Abstract: Recently, video recognition is emerging with the help of multi-modal learning, which focuses on integrating distinct modalities to improve the performance or robustness of the model.