Mid-Push / SmartCLIPLinks
SmartCLIP: A training method to improve CLIP with both short and long texts
β23Updated 3 months ago
Alternatives and similar repositories for SmartCLIP
Users that are interested in SmartCLIP are comparing it to the libraries listed below
Sorting:
- π Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Modelsβ33Updated 2 months ago
- [NeurIPS 2025] Official repository for βFlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Modelsββ23Updated 3 weeks ago
- [CVPR 2025] Adaptive Keyframe Sampling for Long Video Understandingβ108Updated last month
- [ICCV 2025] The official pytorch implement of "LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs".β17Updated 3 months ago
- [NeurIPS2024] - SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusionβ94Updated 4 months ago
- [ICCV25 Oral] Token Activation Map to Visually Explain Multimodal LLMsβ82Updated 2 months ago
- [CVPR2025] FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compressionβ50Updated this week
- TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoningβ103Updated 4 months ago
- [CVPR2025] Official implementation of the paper "Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practiβ¦β36Updated 3 months ago
- [CVPR 2025] Official PyTorch Code for "MMRL: Multi-Modal Representation Learning for Vision-Language Models" and its extension "MMRL++: Pβ¦β76Updated 3 months ago
- Visual Grounding with Multi-modal Conditional Adaptation (ACMMM 2024 Oral)β24Updated 4 months ago
- π Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Modelsβ35Updated last month
- [LLaVA-Video-R1]β¨First Adaptation of R1 to LLaVA-Video (2025-03-18)β32Updated 5 months ago
- [CVPR 2025] LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understandingβ67Updated 3 months ago
- [AAAI2025 selected as oral] - Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraintsβ38Updated 3 months ago
- [ICCV 2025] p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decayβ43Updated 3 months ago
- [CVPR'25] ππ EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answeringβ38Updated 3 months ago
- [ICCV2023] CoTDet: Affordance Knowledge Prompting for Task Driven Object Detectionβ17Updated 5 months ago
- [NeurIPS2024] Repo for the paper `ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models'β193Updated 2 months ago
- [ICCV 2025] UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoing and Understanding.β43Updated this week
- [ECCV24] VISA: Reasoning Video Object Segmentation via Large Language Modelβ19Updated last year
- [CVPR 2025] DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Modelsβ47Updated 4 months ago
- [CVPR 2024] Code for HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generationβ74Updated last year
- β19Updated 5 months ago
- β33Updated 2 months ago
- Survey: https://arxiv.org/pdf/2507.20198β157Updated last month
- β12Updated 8 months ago
- Official code repository of Shuffle-R1β24Updated last month
- [ICME 2024 Oral] DARA: Domain- and Relation-aware Adapters Make Parameter-efficient Tuning for Visual Groundingβ23Updated 7 months ago
- [NeurIPS 2023] The official implementation of SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentationβ33Updated last year