Efficient And Robust Video Understanding For Human-Robot Interaction And Detection