The repository collects many various multi-modal transformer architectures, including image transformer, video transformer, image-language transformer, video-language transformer and self-supervised learning models. Additionally, it also collects many useful tutorials and tools in these related domains.
☆235Aug 27, 2022Updated 3 years ago
Alternatives and similar repositories for Multi-Modal-Transformer
Users that are interested in Multi-Modal-Transformer are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- A curated list of Survey Papers on Deep Learning.☆12Sep 5, 2023Updated 2 years ago
- Recent Transformer-based CV and related works.☆1,343Aug 22, 2023Updated 2 years ago
- A curated resources on what's happening in multimodal learning. Features recent papers, books, related lectures, and other relevant resou…☆16Apr 28, 2023Updated 3 years ago
- LoMaR (Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction)☆68Apr 3, 2025Updated last year
- Reading list for research topics in Masked Image Modeling☆335Dec 3, 2024Updated last year
- Managed Kubernetes at scale on DigitalOcean • AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- Awesome list for research on CLIP (Contrastive Language-Image Pre-Training).☆1,229Jun 28, 2024Updated last year
- A curated list of deep learning resources for video-text retrieval.☆645Oct 20, 2023Updated 2 years ago
- Recent Advances in Vision and Language PreTrained Models (VL-PTMs)☆1,157Aug 19, 2022Updated 3 years ago
- A summarization of Transformer-based architectures for CV tasks, including image classification, object detection, segmentation, and Few-…☆116Jun 17, 2022Updated 3 years ago
- Multi-Modal Transformer for Video Retrieval☆265Oct 9, 2024Updated last year
- ECCV2022,Bootstrapped Masked Autoencoders for Vision BERT Pretraining☆97Nov 2, 2022Updated 3 years ago
- TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.☆1,716Updated this week
- Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"☆1,531Apr 3, 2024Updated 2 years ago
- Reading list for research topics in multimodal machine learning☆6,865Aug 20, 2024Updated last year
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Course repository for the Spring 2023 COMP664 course "Deep Learning" at UNC☆14Apr 17, 2023Updated 3 years ago
- A curated list of awesome vision and language resources (still under construction... stay tuned!)☆560Nov 4, 2024Updated last year
- [NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training☆1,725Dec 8, 2023Updated 2 years ago
- An ultimately comprehensive paper list of Vision Transformer/Attention, including papers, codes, and related websites☆5,035Jul 30, 2024Updated last year
- ☆25Nov 23, 2021Updated 4 years ago
- Keras implementation of "DFNet: Discriminative feature extraction and integration network for salient object detection"☆23Jan 5, 2021Updated 5 years ago
- ☆29Oct 4, 2023Updated 2 years ago
- Code for EMNLP 2020 paper CoDIR☆41Oct 4, 2022Updated 3 years ago
- Align and Prompt: Video-and-Language Pre-training with Entity Prompts☆188May 1, 2025Updated last year
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Code of Decomposition and Completion Network for Salient Object Detection, TIP 2021.☆10Mar 30, 2023Updated 3 years ago
- ☆280Mar 22, 2021Updated 5 years ago
- [CVPR 2021 Best Student Paper Honorable Mention, Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning…☆730Aug 8, 2023Updated 2 years ago
- Official repository for "Boosting Audio Visual Question Answering via Key Semantic-Aware Cues" in ACM MM 2024.☆16Oct 25, 2024Updated last year
- Includes PyTorch -> Keras model porting code for DeiT models with fine-tuning and inference notebooks.☆41Apr 30, 2022Updated 4 years ago
- Codebase for CVPR 2020 paper "Spatio-Temporal Graph for Video Captioning with Knowledge Distillation"☆23Mar 4, 2020Updated 6 years ago
- Adaptive Multimodal Reasoning via Reinforcement Learning☆23Jan 11, 2026Updated 3 months ago
- COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning☆291Sep 6, 2022Updated 3 years ago
- A simple and effective feature extractor for untrimmed videos☆13Sep 1, 2022Updated 3 years ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Deep Multi-Speech model☆11Jul 25, 2018Updated 7 years ago
- Official implementation of "ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos" (ACM ICMRW 2021)☆61Sep 4, 2022Updated 3 years ago
- Code and benchmarks for the Semantic Video Retrieval Task☆53Oct 18, 2022Updated 3 years ago
- Official repository for "Self-Supervised Video Transformer" (CVPR'22)☆108Jun 26, 2024Updated last year
- Easy to use video deep features extractor☆322Jul 5, 2020Updated 5 years ago
- NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks, CVPR 2022 (Oral)☆50Jan 30, 2024Updated 2 years ago
- Code used in the paper "Learning to Learn from Web Data through Deep Semantic Embeddings" ECCV 2018 MULA Workshop☆11Aug 1, 2018Updated 7 years ago