The repository collects many various multi-modal transformer architectures, including image transformer, video transformer, image-language transformer, video-language transformer and self-supervised learning models. Additionally, it also collects many useful tutorials and tools in these related domains.
☆234Aug 27, 2022Updated 3 years ago
Alternatives and similar repositories for Multi-Modal-Transformer
Users that are interested in Multi-Modal-Transformer are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- A curated list of Survey Papers on Deep Learning.☆11Sep 5, 2023Updated 2 years ago
- Recent Transformer-based CV and related works.☆1,338Aug 22, 2023Updated 2 years ago
- A curated resources on what's happening in multimodal learning. Features recent papers, books, related lectures, and other relevant resou…☆16Apr 28, 2023Updated 2 years ago
- LoMaR (Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction)☆67Apr 3, 2025Updated 11 months ago
- Reading list for research topics in Masked Image Modeling☆335Dec 3, 2024Updated last year
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting with the flexibility to host WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Cloudways by DigitalOcean.
- Awesome list for research on CLIP (Contrastive Language-Image Pre-Training).☆1,230Jun 28, 2024Updated last year
- A curated list of deep learning resources for video-text retrieval.☆645Oct 20, 2023Updated 2 years ago
- Recent Advances in Vision and Language PreTrained Models (VL-PTMs)☆1,155Aug 19, 2022Updated 3 years ago
- Multi-Modal Transformer for Video Retrieval☆265Oct 9, 2024Updated last year
- TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.☆1,710Updated this week
- ECCV2022,Bootstrapped Masked Autoencoders for Vision BERT Pretraining☆97Nov 2, 2022Updated 3 years ago
- Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"☆1,528Apr 3, 2024Updated last year
- Papers of "A Survey on Multimodal LLMs from the Perspective of Input-Output Space Extension"☆17Feb 4, 2026Updated last month
- Reading list for research topics in multimodal machine learning☆6,845Aug 20, 2024Updated last year
- Open source password manager - Proton Pass • AdSecurely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
- Course repository for the Spring 2023 COMP664 course "Deep Learning" at UNC☆14Apr 17, 2023Updated 2 years ago
- A curated list of awesome vision and language resources (still under construction... stay tuned!)☆559Nov 4, 2024Updated last year
- [NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training☆1,705Dec 8, 2023Updated 2 years ago
- An ultimately comprehensive paper list of Vision Transformer/Attention, including papers, codes, and related websites☆5,026Jul 30, 2024Updated last year
- ☆25Nov 23, 2021Updated 4 years ago
- Keras implementation of "DFNet: Discriminative feature extraction and integration network for salient object detection"☆23Jan 5, 2021Updated 5 years ago
- ☆29Oct 4, 2023Updated 2 years ago
- Code for EMNLP 2020 paper CoDIR☆41Oct 4, 2022Updated 3 years ago
- Align and Prompt: Video-and-Language Pre-training with Entity Prompts☆188May 1, 2025Updated 10 months ago
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- Code of Decomposition and Completion Network for Salient Object Detection, TIP 2021.☆10Mar 30, 2023Updated 3 years ago
- ☆279Mar 22, 2021Updated 5 years ago
- [CVPR 2021 Best Student Paper Honorable Mention, Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning…☆730Aug 8, 2023Updated 2 years ago
- Official repository for "Boosting Audio Visual Question Answering via Key Semantic-Aware Cues" in ACM MM 2024.☆16Oct 25, 2024Updated last year
- Includes PyTorch -> Keras model porting code for DeiT models with fine-tuning and inference notebooks.☆41Apr 30, 2022Updated 3 years ago
- [2025 CVPR] Towards Open-Vocabulary Audio-Visual Event Localization☆43Mar 7, 2025Updated last year
- Codebase for CVPR 2020 paper "Spatio-Temporal Graph for Video Captioning with Knowledge Distillation"☆23Mar 4, 2020Updated 6 years ago
- Adaptive Multimodal Reasoning via Reinforcement Learning☆23Jan 11, 2026Updated 2 months ago
- COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning☆291Sep 6, 2022Updated 3 years ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting with the flexibility to host WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Cloudways by DigitalOcean.
- A simple and effective feature extractor for untrimmed videos☆13Sep 1, 2022Updated 3 years ago
- Official implementation of "ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos" (ACM ICMRW 2021)☆60Sep 4, 2022Updated 3 years ago
- Code and benchmarks for the Semantic Video Retrieval Task☆53Oct 18, 2022Updated 3 years ago
- Official repository for "Self-Supervised Video Transformer" (CVPR'22)☆108Jun 26, 2024Updated last year
- Enhancing Surgical Instrument Segmentation: Integrating Vision Transformer Insights with Adapter☆12Mar 21, 2024Updated 2 years ago
- Easy to use video deep features extractor☆322Jul 5, 2020Updated 5 years ago
- NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks, CVPR 2022 (Oral)☆50Jan 30, 2024Updated 2 years ago