The repository collects many various multi-modal transformer architectures, including image transformer, video transformer, image-language transformer, video-language transformer and self-supervised learning models. Additionally, it also collects many useful tutorials and tools in these related domains.
☆235Aug 27, 2022Updated 3 years ago
Alternatives and similar repositories for Multi-Modal-Transformer
Users that are interested in Multi-Modal-Transformer are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- A curated list of Survey Papers on Deep Learning.☆12Sep 5, 2023Updated 2 years ago
- Recent Transformer-based CV and related works.☆1,344Aug 22, 2023Updated 2 years ago
- A curated resources on what's happening in multimodal learning. Features recent papers, books, related lectures, and other relevant resou…☆16Apr 28, 2023Updated 3 years ago
- LoMaR (Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction)☆68Apr 3, 2025Updated last year
- Reading list for research topics in Masked Image Modeling☆334Dec 3, 2024Updated last year
- Managed Kubernetes at scale on DigitalOcean • AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- A curated list of deep learning resources for video-text retrieval.☆644Oct 20, 2023Updated 2 years ago
- Awesome list for research on CLIP (Contrastive Language-Image Pre-Training).☆1,231Jun 28, 2024Updated last year
- Recent Advances in Vision and Language PreTrained Models (VL-PTMs)☆1,157Aug 19, 2022Updated 3 years ago
- A summarization of Transformer-based architectures for CV tasks, including image classification, object detection, segmentation, and Few-…☆116Jun 17, 2022Updated 3 years ago
- Multi-Modal Transformer for Video Retrieval☆265Oct 9, 2024Updated last year
- ECCV2022,Bootstrapped Masked Autoencoders for Vision BERT Pretraining☆97Nov 2, 2022Updated 3 years ago
- TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.☆1,719May 18, 2026Updated last week
- Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"☆1,533Apr 3, 2024Updated 2 years ago
- Reading list for research topics in multimodal machine learning☆6,874Aug 20, 2024Updated last year
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- A curated list of awesome vision and language resources (still under construction... stay tuned!)☆562Nov 4, 2024Updated last year
- [NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training☆1,745Dec 8, 2023Updated 2 years ago
- An ultimately comprehensive paper list of Vision Transformer/Attention, including papers, codes, and related websites☆5,035Jul 30, 2024Updated last year
- ☆25Nov 23, 2021Updated 4 years ago
- ☆29Oct 4, 2023Updated 2 years ago
- Align and Prompt: Video-and-Language Pre-training with Entity Prompts☆188May 1, 2025Updated last year
- ☆280Mar 22, 2021Updated 5 years ago
- [CVPR 2021 Best Student Paper Honorable Mention, Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning…☆730Aug 8, 2023Updated 2 years ago
- Official repository for "Boosting Audio Visual Question Answering via Key Semantic-Aware Cues" in ACM MM 2024.☆16Oct 25, 2024Updated last year
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- Includes PyTorch -> Keras model porting code for DeiT models with fine-tuning and inference notebooks.☆41Apr 30, 2022Updated 4 years ago
- [2025 CVPR] Towards Open-Vocabulary Audio-Visual Event Localization☆44Mar 7, 2025Updated last year
- Adaptive Multimodal Reasoning via Reinforcement Learning☆23Jan 11, 2026Updated 4 months ago
- Codebase for CVPR 2020 paper "Spatio-Temporal Graph for Video Captioning with Knowledge Distillation"☆23Mar 4, 2020Updated 6 years ago
- COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning☆291Sep 6, 2022Updated 3 years ago
- A simple and effective feature extractor for untrimmed videos☆13Sep 1, 2022Updated 3 years ago
- Deep Multi-Speech model☆11Jul 25, 2018Updated 7 years ago
- Official implementation of "ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos" (ACM ICMRW 2021)☆61Sep 4, 2022Updated 3 years ago
- Code and benchmarks for the Semantic Video Retrieval Task☆53Oct 18, 2022Updated 3 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Official repository for "Self-Supervised Video Transformer" (CVPR'22)☆108Jun 26, 2024Updated last year
- Easy to use video deep features extractor☆322Jul 5, 2020Updated 5 years ago
- NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks, CVPR 2022 (Oral)☆50Jan 30, 2024Updated 2 years ago
- Code used in the paper "Learning to Learn from Web Data through Deep Semantic Embeddings" ECCV 2018 MULA Workshop☆11Aug 1, 2018Updated 7 years ago
- A curated list of Multimodal Related Research.☆1,394Aug 5, 2023Updated 2 years ago
- Code Release for MeMViT Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition, CVPR 2022☆155Nov 30, 2022Updated 3 years ago
- Repository for research works and resources related to model reprogramming <https://arxiv.org/abs/2202.10629>☆64Sep 17, 2025Updated 8 months ago