Mid-Push / SmartCLIPLinks
SmartCLIP: A training method to improve CLIP with both short and long texts
☆24Updated 4 months ago
Alternatives and similar repositories for SmartCLIP
Users that are interested in SmartCLIP are comparing it to the libraries listed below
Sorting:
- [CVPR2025] FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression☆54Updated 3 weeks ago
- [NeurIPS 2025] Official repository for “FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models”☆24Updated last month
- [CVPR 2025] Official PyTorch Code for "MMRL: Multi-Modal Representation Learning for Vision-Language Models" and its extension "MMRL++: P…☆81Updated 4 months ago
- Official code repository of Shuffle-R1☆24Updated 2 months ago
- Code of LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents☆17Updated 4 months ago
- [ICCV25 Oral] Token Activation Map to Visually Explain Multimodal LLMs☆107Updated 2 months ago
- [NeurIPS2024] - SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion☆94Updated this week
- ☆12Updated 9 months ago
- 🚀 Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models☆36Updated last week
- 🚀 Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models☆33Updated 3 months ago
- [AAAI2025 selected as oral] - Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints☆40Updated 4 months ago
- TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning☆105Updated 5 months ago
- [CVPR2025] Official implementation of the paper "Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practi…☆37Updated this week
- [CVPR 2025] LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding☆75Updated 3 months ago
- [AAAI 2025] CoPEFT: Fast Adaptation Framework for Multi-Agent Collaborative Perception with Parameter-Efficient Fine-Tuning☆23Updated 6 months ago
- [CVPR 2025] Adaptive Keyframe Sampling for Long Video Understanding☆120Updated 2 months ago
- [ICML2025] Official Code of From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection☆24Updated 4 months ago
- ☆12Updated 10 months ago
- Some experiences for new researchers to grow grow up☆43Updated 2 years ago
- Official implementation of ResCLIP: Residual Attention for Training-free Dense Vision-language Inference☆49Updated last week
- [LLaVA-Video-R1]✨First Adaptation of R1 to LLaVA-Video (2025-03-18)☆34Updated 5 months ago
- [NeurIPS-2024] The offical Implementation of "Instruction-Guided Visual Masking"☆39Updated 11 months ago
- [ICASSP 2024] VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders☆16Updated 8 months ago
- [NeurIPS 2024] MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models☆72Updated 5 months ago
- [ECCV24] VISA: Reasoning Video Object Segmentation via Large Language Model☆19Updated last year
- [AAAI 2025] Official implementation of the paper "EOV-Seg: Efficient Open-Vocabulary Panoptic Segmentation"☆32Updated 10 months ago
- [NeurIPS2024] Repo for the paper `ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models'☆195Updated 3 months ago
- [ICCV2023] CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection☆17Updated 6 months ago
- Official Implementation of OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation☆33Updated 3 months ago
- Implementation of "VL-Mamba: Exploring State Space Models for Multimodal Learning"☆84Updated last year