Isolated Sign Recognition from RGB Video using Pose Flow and Self-Attention
Temas
Detalles
Automatic sign language recognition lies at the intersec- tion of natural language processing (NLP) and computer vision. The highly successful transformer architectures, based on multi-head attention, originate from the field of NLP. The Video Transformer Network (VTN) is an adapta- tion of this concept for tasks that require video understand- ing, e.g., action recognition. However, due to the limited amount of labeled data that is commonly available for train- ing automatic sign (language) recognition, the VTN cannot reach its full potential in this domain. In this work, we re- duce the impact of this data limitation by automatically pre- extracting useful information from the sign language videos. In our approach, different types of information are offered to a VTN in a multi-modal setup. It includes per-frame hu- man pose keypoints (extracted by OpenPose) to capture the body movement and hand crops to capture the (evolution of) hand shapes. We evaluate our method on the recently re- leased AUTSL dataset for isolated sign recognition and ob- tain 92.92% accuracy on the test set using only RGB data. For comparison: the VTN architecture without hand crops and pose flow achieved 82% accuracy. A qualitative inspec- tion of our model hints at further potential of multi-modal multi-head attention in a sign language recognition context.

 
       
       
      