The repository collects many various multi-modal transformer architectures, including image transformer, video transformer, image-language transformer, video-language transformer and self-supervised learning models. Additionally, it also collects many useful tutorials and tools in these related domains.
☆235Aug 27, 2022Updated 3 years ago
Alternatives and similar repositories for Multi-Modal-Transformer
Users that are interested in Multi-Modal-Transformer are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- A curated list of Survey Papers on Deep Learning.☆12Sep 5, 2023Updated 2 years ago
- Recent Transformer-based CV and related works.☆1,343Aug 22, 2023Updated 2 years ago
- A curated resources on what's happening in multimodal learning. Features recent papers, books, related lectures, and other relevant resou…☆16Apr 28, 2023Updated 3 years ago
- LoMaR (Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction)☆68Apr 3, 2025Updated last year
- Reading list for research topics in Masked Image Modeling☆334Dec 3, 2024Updated last year
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- A curated list of deep learning resources for video-text retrieval.☆645Oct 20, 2023Updated 2 years ago
- Recent Advances in Vision and Language PreTrained Models (VL-PTMs)☆1,157Aug 19, 2022Updated 3 years ago
- Multi-Modal Transformer for Video Retrieval☆265Oct 9, 2024Updated last year
- ECCV2022,Bootstrapped Masked Autoencoders for Vision BERT Pretraining☆97Nov 2, 2022Updated 3 years ago
- TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.☆1,723Jun 8, 2026Updated last week
- Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"☆1,535Apr 3, 2024Updated 2 years ago
- Papers of "A Survey on Multimodal LLMs from the Perspective of Input-Output Space Extension"☆19Feb 4, 2026Updated 4 months ago
- Reading list for research topics in multimodal machine learning☆6,880Aug 20, 2024Updated last year
- A curated list of awesome vision and language resources (still under construction... stay tuned!)☆562Nov 4, 2024Updated last year
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- [NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training☆1,759Dec 8, 2023Updated 2 years ago
- An ultimately comprehensive paper list of Vision Transformer/Attention, including papers, codes, and related websites☆5,040Jul 30, 2024Updated last year
- ☆25Nov 23, 2021Updated 4 years ago
- Keras implementation of "DFNet: Discriminative feature extraction and integration network for salient object detection"☆23Jan 5, 2021Updated 5 years ago
- ☆29Oct 4, 2023Updated 2 years ago
- Code for EMNLP 2020 paper CoDIR☆41Oct 4, 2022Updated 3 years ago
- Align and Prompt: Video-and-Language Pre-training with Entity Prompts☆188May 1, 2025Updated last year
- Code of Decomposition and Completion Network for Salient Object Detection, TIP 2021.☆10Mar 30, 2023Updated 3 years ago
- ☆280Mar 22, 2021Updated 5 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- [CVPR 2021 Best Student Paper Honorable Mention, Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning…☆730Aug 8, 2023Updated 2 years ago
- Official repository for "Boosting Audio Visual Question Answering via Key Semantic-Aware Cues" in ACM MM 2024.☆16Oct 25, 2024Updated last year
- [NeurIPS 2025] Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM☆27Feb 10, 2026Updated 4 months ago
- [2025 CVPR] Towards Open-Vocabulary Audio-Visual Event Localization☆45Mar 7, 2025Updated last year
- Adaptive Multimodal Reasoning via Reinforcement Learning☆23Jan 11, 2026Updated 5 months ago
- COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning☆291Sep 6, 2022Updated 3 years ago
- A simple and effective feature extractor for untrimmed videos☆13Sep 1, 2022Updated 3 years ago
- Deep Multi-Speech model☆11Jul 25, 2018Updated 7 years ago
- Official implementation of "ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos" (ACM ICMRW 2021)☆61Sep 4, 2022Updated 3 years ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- Code and benchmarks for the Semantic Video Retrieval Task☆53Oct 18, 2022Updated 3 years ago
- Official repository for "Self-Supervised Video Transformer" (CVPR'22)☆109Jun 26, 2024Updated last year
- Easy to use video deep features extractor☆322Jul 5, 2020Updated 5 years ago
- Code used in the paper "Learning to Learn from Web Data through Deep Semantic Embeddings" ECCV 2018 MULA Workshop☆11Aug 1, 2018Updated 7 years ago
- A curated list of Multimodal Related Research.☆1,391Aug 5, 2023Updated 2 years ago
- Code Release for MeMViT Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition, CVPR 2022☆155Nov 30, 2022Updated 3 years ago
- [CVPR 2022] OCSampler: Compressing Videos to One Clip with Single-step Sampling☆17Jun 21, 2022Updated 3 years ago