xiaoxing2001 / DeGLALinks
[ACM MM25] Official Pytorch implementation of [Decoupled Global-Local Alignment for Improving Compositional Understanding]
☆13Updated 2 months ago
Alternatives and similar repositories for DeGLA
Users that are interested in DeGLA are comparing it to the libraries listed below
Sorting:
- [ICCV 2025] Dynamic-VLM☆25Updated 9 months ago
- Official repository of "CoMP: Continual Multimodal Pre-training for Vision Foundation Models"☆32Updated 6 months ago
- [ACM MM25] The official code of "Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs"☆92Updated 2 months ago
- [ACM MM2025] The official repository for the RealSyn dataset☆37Updated 3 months ago
- SFT+RL boosts multimodal reasoning☆34Updated 3 months ago
- ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs☆25Updated last month
- \infty-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation☆16Updated 7 months ago
- [NeurIPS 2024] TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration☆24Updated 11 months ago
- WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs☆30Updated last week
- MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models☆41Updated 6 months ago
- [ArXiv] V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding☆57Updated 9 months ago
- Official implementation of paper ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding☆37Updated 6 months ago
- (ICCV2025) Official repository of paper "ViSpeak: Visual Instruction Feedback in Streaming Videos"☆40Updated 3 months ago
- The official implementation of the paper "MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding". …☆58Updated 11 months ago
- The code for "VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by VIdeo SpatioTemporal Augmentation" [CVPR2025]☆19Updated 7 months ago
- [NeurIPS'24] Official PyTorch Implementation of Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment☆57Updated last year
- ☆33Updated 6 months ago
- A Simple Framework of Small-scale LMMs for Video Understanding☆94Updated 4 months ago
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆60Updated 2 months ago
- Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning☆37Updated 3 months ago
- [MM2024, oral] "Self-Supervised Visual Preference Alignment" https://arxiv.org/abs/2404.10501☆56Updated last year
- [EMNLP25 Main]The official code of "Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval"☆16Updated 3 weeks ago
- On Path to Multimodal Generalist: General-Level and General-Bench☆19Updated 3 months ago
- UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model☆22Updated last year
- Official implement of MIA-DPO☆66Updated 8 months ago
- [ICLR2025] γ -MOD: Mixture-of-Depth Adaptation for Multimodal Large Language Models☆39Updated 7 months ago
- [CVPR 2025] OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts☆18Updated 6 months ago
- [NeurIPS 2025] The official repository of "Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tun…☆37Updated 7 months ago
- Margin-based Vision Transformer☆41Updated 2 weeks ago
- Multimodal Open-O1 (MO1) is designed to enhance the accuracy of inference models by utilizing a novel prompt-based approach. This tool wo…☆29Updated last year