YehLi / xmodalerLinks

X-modaler is a versatile and high-performance codebase for cross-modal analytics(e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense reasoning, and cross-modal retrieval).

☆968

Alternatives and similar repositories for xmodaler

Users that are interested in xmodaler are comparing it to the libraries listed below

Sorting:

starmemda / CAMoE
☆101Updated 4 years ago
njustkmg / OMML
Multi-Modal learning toolkit based on PaddlePaddle and PyTorch, supporting multiple applications such as multi-modal classification, cros…
☆478Updated 2 years ago
jayleicn / ClipBERT
[CVPR 2021 Best Student Paper Honorable Mention, Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning…
☆724Updated 2 years ago
BruceW91 / CVSE
The official source code for the paper Consensus-Aware Visual-Semantic Embedding for Image-Text Matching (ECCV 2020)
☆167Updated 3 years ago
Sense-X / X-Temporal
A general video understanding codebase from SenseTime X-Lab
☆445Updated 4 years ago
forence / Awesome-Visual-Captioning
This repository focus on Image Captioning & Video Captioning & Seq-to-Seq Learning & NLP
☆414Updated 3 years ago
microsoft / VideoX
VideoX: a collection of video cross-modal models
☆1,047Updated last year
JDAI-CV / image-captioning
Implementation of 'X-Linear Attention Networks for Image Captioning' [CVPR 2020]
☆275Updated 4 years ago
whwu95 / Cap4Video
【CVPR'2023 Highlight & TPAMI】Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
☆248Updated last year
whwu95 / MVFNet
【AAAI'2021】MVFNet: Multi-View Fusion Network for Efficient Video Recognition
☆134Updated 3 years ago
danieljf24 / awesome-video-text-retrieval
A curated list of deep learning resources for video-text retrieval.
☆638Updated 2 years ago
microsoft / UniVL
An official implementation for " UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation"
☆362Updated last year
FingerRec / BE
[CVPR2021] The source code for our paper 《Removing the Background by Adding the Background: Towards Background Robust Self-supervised Vid…
☆135Updated 4 years ago
MILVLG / bottom-up-attention.pytorch
A PyTorch reimplementation of bottom-up-attention models
☆304Updated 3 years ago
gabeur / mmt
Multi-Modal Transformer for Video Retrieval
☆264Updated last year
OpenNLPLab / AVSBench
[ECCV 2022] & [IJCV 2024] Official implementation of the paper: Audio-Visual Segmentation (with Semantics)
☆406Updated last year
jackroos / VL-BERT
Code for ICLR 2020 paper "VL-BERT: Pre-training of Generic Visual-Linguistic Representations".
☆746Updated 2 years ago
zdou0830 / METER
METER: A Multimodal End-to-end TransformER Framework
☆374Updated 3 years ago
MasterBin-IIAU / UNINEXT
[CVPR'23] Universal Instance Perception as Object Discovery and Retrieval
☆1,280Updated 2 years ago
ArrowLuo / CLIP4Clip
An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"
☆1,006Updated last year
microsoft / Oscar
Oscar and VinVL
☆1,051Updated 2 years ago
albanie / collaborative-experts
Video embeddings for retrieval with natural language queries
☆342Updated 2 years ago
cshizhe / hgr_v2t
Code accompanying the paper "Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning".
☆212Updated 5 years ago
uta-smile / TCL
code for TCL: Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022
☆267Updated last year
microsoft / SwinBERT
Research code for CVPR 2022 paper "SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning"
☆246Updated 3 years ago
aimagelab / meshed-memory-transformer
Meshed-Memory Transformer for Image Captioning. CVPR 2020
☆540Updated 2 years ago
linjieli222 / HERO
Research code for EMNLP 2020 paper "HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training"
☆235Updated 4 years ago
zengyan-97 / X-VLM
X-VLM: Multi-Grained Vision Language Pre-Training (ICML 2022)
☆487Updated 3 years ago
ChenRocks / UNITER
Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"
☆797Updated 4 years ago
Paranioar / Awesome_Matching_Pretraining_Transfering
The Paper List of Large Multi-Modality Model (Perception, Generation, Unification), Parameter-Efficient Finetuning, Vision-Language Pretr…
☆434Updated 2 months ago