iOPENCap / awesome-unimodal-trainingLinks

text-only training or language-free training for multimodal tasks (image/audio/video caption, retrieval, text2image)

☆11

Alternatives and similar repositories for awesome-unimodal-training

Users that are interested in awesome-unimodal-training are comparing it to the libraries listed below

Sorting:

assafbk / mocha_code
Mitigating Open-Vocabulary Caption Hallucinations (EMNLP 2024)
☆17Updated last year
THUNLP-MT / Brote
☆11Updated 8 months ago
adobe-research / llava-score
☆11Updated last year
Kamichanw / ICLTestbed
An in-context learning research testbed
☆19Updated 7 months ago
THUNLP-MT / CODIS
Repo for paper "CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models".
☆12Updated last year
junyangwang0410 / Attention-LLaVA
A hot-pluggable tool for visualizing LLaVA's attention.
☆23Updated last year
tingyu215 / TS-LLaVA
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
☆16Updated 9 months ago
zhiyuanhubj / Long_form_VideoQA
[EMNLP’24 Main] Encoding and Controlling Global Semantics for Long-form Video Question Answering
☆18Updated last year
michelecafagna26 / cider
Pythonic wrappers for Cider/CiderD evaluation metrics. Provides CIDEr as well as CIDEr-D (CIDEr Defended) which is more robust to gaming …
☆13Updated last year
deepglint / Victor
ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs
☆25Updated 2 months ago
SCZwangxiao / video-ReTaKe
Official implementation of paper ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
☆37Updated 7 months ago
lyan62 / FoodieQA
Official Repo for FoodieQA paper (EMNLP 2024)
☆16Updated 3 months ago
Yaojie-Shen / CoCap
[ICCV 2023] Accurate and Fast Compressed Video Captioning
☆47Updated 2 months ago
MrZilinXiao / AutoVER
[ECCV'24] Official Implementation of Autoregressive Visual Entity Recognizer.
☆14Updated last year
Jiaxuan-Li / EVCap
[CVPR 2024] Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension
☆56Updated last year
jpthu17 / DiCoSA
[IJCAI 2023] Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment
☆53Updated last year
JaaackHongggg / WorldSense
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
☆31Updated 3 weeks ago
ForJadeForest / Lever-LM
The Code for Lever LM: Configuring In-Context Sequence to Lever Large Vision Language Models
☆16Updated last year
yuezih / less-is-more
Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective (ACL 2024)
☆54Updated 11 months ago
1zhou-Wang / MemVR
[ICML 2025] Official implementation of paper 'Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in…
☆158Updated 3 weeks ago
ailab-kyunghee / CM2_DVC
[CVPR 2024] Do you remember? Dense Video Captioning with Cross-Modal Memory Retrieval
☆59Updated last year
nguyentthong / video-language-understanding
[ACL’24 Findings] Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives
☆42Updated 3 months ago
AV-Odyssey / AV-Odyssey
This repo contains evaluation code for the paper "AV-Odyssey: Can Your Multimodal LLMs Really Understand Audio-Visual Information?"
☆30Updated 9 months ago
BUAADreamer / SPN4CIR
[ACM MM 2024] Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives
☆39Updated last month
Lilidamowang / T2VIndexer-generativeSearch
☆12Updated last year
minjoong507 / BM-DETR
[WACV 2025] Official Pytorch code for "Background-aware Moment Detection for Video Moment Retrieval"
☆16Updated 7 months ago
aiming-lab / ReAgent-V
[NeurIPS'25] ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding
☆37Updated 3 weeks ago
GeWu-Lab / TSPM
Official repository for "Boosting Audio Visual Question Answering via Key Semantic-Aware Cues" in ACM MM 2024.
☆17Updated 11 months ago
RERV / UniAdapter
[ICLR2024] The official implementation of paper "UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling", by …
☆76Updated last year
alipay / PC2-NoiseofWeb
Noise of Web (NoW) is a challenging noisy correspondence learning (NCL) benchmark containing 100K image-text pairs for robust image-text …
☆15Updated 2 months ago