iOPENCap / awesome-unimodal-training
text-only training or language-free training for multimodal tasks (image/audio/video caption, retrieval, text2image)
β11Updated 3 months ago
Alternatives and similar repositories for awesome-unimodal-training:
Users that are interested in awesome-unimodal-training are comparing it to the libraries listed below
- [ICLR 2025] TRACE: Temporal Grounding Video LLM via Casual Event Modelingβ63Updated last week
- π₯ Omni large models and datasets for understanding and generating multi-modalities.β13Updated 3 months ago
- This is the first released survey paper on hallucinations of large vision-language models (LVLMs). To keep track of this field and continβ¦β60Updated 6 months ago
- β11Updated last week
- Contrastive Video Question Answering via Video Graph Transformer (IEEE T-PAMI'23)β19Updated 10 months ago
- β9Updated 8 months ago
- [CVPR 2024] Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehensionβ44Updated 9 months ago
- [ECCVβ24] Official Implementation for CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarioβ¦β48Updated 4 months ago
- Pytorch Code for "Unified Coarse-to-Fine Alignment for Video-Text Retrieval" (ICCV 2023)β62Updated 7 months ago
- β11Updated last year
- [CVPR 2024] Do you remember? Dense Video Captioning with Cross-Modal Memory Retrievalβ48Updated 7 months ago
- mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. (EMNLP 2022)β87Updated last year
- USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval, TIP 2024β28Updated 10 months ago
- Official implementation of paper ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understandingβ19Updated 3 weeks ago
- Codes for ICML 2024 paper: "Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition"β84Updated last month
- Official PyTorch code of "Grounded Question-Answering in Long Egocentric Videos", accepted by CVPR 2024.β56Updated 4 months ago
- Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective (ACL 2024)β41Updated 3 months ago
- DEEM: Official implementation of Diffusion models serve as the eyes of large language models for image perception. (ICLR2025)β18Updated last month
- [CVPR 2024] Context-Guided Spatio-Temporal Video Groundingβ45Updated 7 months ago
- Can I Trust Your Answer? Visually Grounded Video Question Answering (CVPR'24, Highlight)β63Updated 6 months ago
- [Paper][AAAI2024]Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representationsβ127Updated 7 months ago
- [EMNLP'23] The official GitHub page for ''Evaluating Object Hallucination in Large Vision-Language Models''β77Updated 10 months ago
- NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR'21)β141Updated 6 months ago
- [ICLR2024] The official implementation of paper "UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling", by β¦β72Updated last year
- This repo holds the official code and data for "Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Hβ¦β17Updated 8 months ago
- [AAAI 2025] VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Groundingβ86Updated last month
- Large Language Models are Temporal and Causal Reasoners for Video Question Answering (EMNLP 2023)β74Updated 6 months ago
- [ICLR 2024] Analyzing and Mitigating Object Hallucination in Large Vision-Language Modelsβ141Updated 9 months ago
- Video Graph Transformer for Video Question Answering (ECCV'22)β46Updated last year
- Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models (AAAI 2024)β67Updated 3 weeks ago