aimagelab / ReTLinks
[CVPR 2025] Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval
β23Updated last week
Alternatives and similar repositories for ReT
Users that are interested in ReT are comparing it to the libraries listed below
Sorting:
- [NeurIPS2024] Repo for the paper `ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models'β192Updated 2 months ago
- [CVPR 2025 π₯]A Large Multimodal Model for Pixel-Level Visual Grounding in Videosβ83Updated 5 months ago
- [NeurIPS 2024] MoVA: Adapting Mixture of Vision Experts to Multimodal Contextβ166Updated 11 months ago
- Official implementation of "Why are Visually-Grounded Language Models Bad at Image Classification?" (NeurIPS 2024)β91Updated 11 months ago
- [ICCV 2025] VisRL: Intention-Driven Visual Perception via Reinforced Reasoningβ39Updated 3 months ago
- [CVPR 2025] Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answeringβ46Updated 2 months ago
- [CVPR 2025] RAP: Retrieval-Augmented Personalizationβ69Updated last month
- Visual self-questioning for large vision-language assistant.β43Updated last month
- [NeurIPS 2024] MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Modelsβ72Updated 4 months ago
- [CVPR 2025] Adaptive Keyframe Sampling for Long Video Understandingβ103Updated 3 weeks ago
- [CVPR 2024] Official Code for the Paper "Compositional Chain-of-Thought Prompting for Large Multimodal Models"β135Updated last year
- [CVPR 2024] Improving language-visual pretraining efficiency by perform cluster-based masking on images.β29Updated last year
- [ECCV 2024] API: Attention Prompting on Image for Large Vision-Language Modelsβ102Updated 11 months ago
- Official code for NeurIPS 2025 paper "GRIT: Teaching MLLMs to Think with Images"β137Updated last month
- β70Updated last year
- [NeurIPS 2024] Official PyTorch implementation of LoTLIP: Improving Language-Image Pre-training for Long Text Understandingβ45Updated 8 months ago
- [ECCV 2024] ControlCap: Controllable Region-level Captioningβ79Updated 10 months ago
- Implementation of "VL-Mamba: Exploring State Space Models for Multimodal Learning"β83Updated last year
- [ECCV 2024] Official PyTorch implementation of DreamLIP: Language-Image Pre-training with Long Captionsβ136Updated 4 months ago
- [CVPR2025] Code Release of F-LMM: Grounding Frozen Large Multimodal Modelsβ103Updated 3 months ago
- [CVPR 2025] Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attentionβ46Updated last year
- [ECCV24] VISA: Reasoning Video Object Segmentation via Large Language Modelβ189Updated last year
- [BMVC 2024 Oral β¨] Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimizationβ18Updated last year
- [CVPR'2025] VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".β189Updated 3 months ago
- PyTorch code for "Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training"β36Updated last year
- [NeurIPS 2024] Mitigating Object Hallucination via Concentric Causal Attentionβ61Updated 3 weeks ago
- [ICCVW 25] LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuningβ150Updated last month
- γNeurIPS 2024γDense Connector for MLLMsβ175Updated 11 months ago
- [ECCV 2024] Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMsβ140Updated 10 months ago
- [CVPR 2025 Oral] VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selectionβ117Updated last month