junchen14 / Multi-Modal-TransformerLinks
The repository collects many various multi-modal transformer architectures, including image transformer, video transformer, image-language transformer, video-language transformer and self-supervised learning models.  Additionally, it also collects  many useful tutorials and tools in these related domains. 
☆230Updated 3 years ago
Alternatives and similar repositories for Multi-Modal-Transformer
Users that are interested in Multi-Modal-Transformer are comparing it to the libraries listed below
Sorting:
- Code Release for MeMViT Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition, CVPR 2022☆151Updated 2 years ago
 - Recent Advances in Vision and Language Pre-training (VLP)☆294Updated 2 years ago
 - [CVPR 2022] Official code for "Unified Contrastive Learning in Image-Text-Label Space"☆402Updated last year
 - Video Contrastive Learning with Global Context, ICCVW 2021☆160Updated 3 years ago
 - The official implementation of our paper "Sports Video Analysis on Large-scale Data" (https://arxiv.org/abs/2208.04897)☆78Updated 2 years ago
 - PyTorch implementation of BEVT (CVPR 2022) https://arxiv.org/abs/2112.01529☆164Updated 3 years ago
 - ☆57Updated 3 years ago
 - ☆179Updated 3 years ago
 - Align and Prompt: Video-and-Language Pre-training with Entity Prompts☆187Updated 6 months ago
 - An official implementation for " UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation"☆362Updated last year
 - Official implementation of "Everything at Once - Multi-modal Fusion Transformer for Video Retrieval." CVPR 2022☆112Updated 3 years ago
 - ☆191Updated 3 years ago
 - Code release for ICCV 2021 paper "Anticipative Video Transformer"☆153Updated 3 years ago
 - PyTorch implementation of a collections of scalable Video Transformer Benchmarks.☆303Updated 3 years ago
 - PyTorch code for "VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks" (CVPR2022)☆207Updated 2 years ago
 - [CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers☆189Updated 2 years ago
 - Multi-Modal Transformer for Video Retrieval☆262Updated last year
 - End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021)☆224Updated last year
 - code for TCL: Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022☆266Updated last year
 - Official repository for "Self-Supervised Video Transformer" (CVPR'22)☆107Updated last year
 - Official pytorch implementation of paper "VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples" (CVP…☆147Updated 4 years ago
 - Official Pytorch implementation of "Probabilistic Cross-Modal Embedding" (CVPR 2021)☆133Updated last year
 - A Survey on multimodal learning research.☆333Updated 2 years ago
 - Using VideoBERT to tackle video prediction☆132Updated 4 years ago
 - [ACL 2023] Official PyTorch code for Singularity model in "Revealing Single Frame Bias for Video-and-Language Learning"☆136Updated 2 years ago
 - MixGen: A New Multi-Modal Data Augmentation☆126Updated 2 years ago
 - METER: A Multimodal End-to-end TransformER Framework☆373Updated 2 years ago
 - [ICLR 2022] code for "How Much Can CLIP Benefit Vision-and-Language Tasks?" https://arxiv.org/abs/2107.06383☆416Updated 3 years ago
 - An unofficial implementation of TubeViT in "Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning"☆92Updated last year
 - ☆70Updated 4 years ago