XMUDeepLIT / UME-R1Links
☆30Updated this week
Alternatives and similar repositories for UME-R1
Users that are interested in UME-R1 are comparing it to the libraries listed below
Sorting:
- [ACM MM 2025] The official code of "Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs"☆96Updated last week
- LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning☆73Updated 6 months ago
- [ACM MM 2024] Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives☆39Updated 3 months ago
- [ICLR 2023] This is the code repo for our ICLR‘23 paper "Universal Vision-Language Dense Retrieval: Learning A Unified Representation Spa…☆53Updated last year
- [Paper][AAAI2024]Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations☆153Updated last year
- Evaluation code and datasets for the ACL 2024 paper, VISTA: Visualized Text Embedding for Universal Multi-Modal Retrieval. The original c…☆45Updated last year
- [CVPR 2024] Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension☆60Updated last year
- [CVPR 2025] LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant☆171Updated 5 months ago
- [SIGIR 2024] - Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image Retrieval☆43Updated last year
- ☆25Updated last year
- Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos☆26Updated last year
- mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. (EMNLP 2022)☆97Updated 2 years ago
- Code for DeCo: Decoupling token compression from semanchc abstraction in multimodal large language models☆75Updated 5 months ago
- All-In-One VLM: Image + Video + Transfer to Other Languages / Domains (TPAMI 2023)☆166Updated last year
- Narrative movie understanding benchmark☆77Updated 6 months ago
- [IEEE TMM 2025 & ACL 2024 Findings] LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition☆35Updated 5 months ago
- Official implementation of paper ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding☆38Updated 9 months ago
- ☆53Updated 2 weeks ago
- EMNLP2023 - InfoSeek: A New VQA Benchmark focus on Visual Info-Seeking Questions☆25Updated last year
- [CVPR 2024] LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge☆153Updated 3 months ago
- ☆85Updated last year
- GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection (AAAI 2024)☆72Updated last year
- (CVPR2024) MeaCap: Memory-Augmented Zero-shot Image Captioning☆54Updated last year
- [CVPR 2024] Official Code for the Paper "Compositional Chain-of-Thought Prompting for Large Multimodal Models"☆144Updated last year
- Official repository of MMDU dataset☆98Updated last year
- [ICCV'25] HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics☆37Updated 3 months ago
- [CVPR 2023] VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval☆38Updated 2 years ago
- An benchmark for evaluating the capabilities of large vision-language models (LVLMs)☆46Updated 2 years ago
- A Survey on Benchmarks of Multimodal Large Language Models☆145Updated 5 months ago
- 【ICLR 2024, Spotlight】Sentence-level Prompts Benefit Composed Image Retrieval☆91Updated last year