Mid-Push / SmartCLIPLinks

SmartCLIP: A training method to improve CLIP with both short and long texts

☆26

Alternatives and similar repositories for SmartCLIP

Users that are interested in SmartCLIP are comparing it to the libraries listed below

Sorting:

EIT-NLP / Layer_Select_Fuse_for_MLLM
[CVPR2025] Official implementation of the paper "Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practi…
☆39Updated 3 weeks ago
ZhangXJ199 / TinyLLaVA-Video-R1
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
☆107Updated 6 months ago
XenoZLH / Shuffle-R1
Official code repository of Shuffle-R1
☆25Updated 2 months ago
yunncheng / MMRL
[CVPR 2025] Official PyTorch Code for "MMRL: Multi-Modal Representation Learning for Vision-Language Models" and its extension "MMRL++: P…
☆83Updated 5 months ago
CnFaker / LLaVA-SP
[ICCV 2025] The official pytorch implement of "LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs".
☆20Updated 3 weeks ago
yvhangyang / ResCLIP
Official implementation of ResCLIP: Residual Attention for Training-free Dense Vision-language Inference
☆52Updated 3 weeks ago
xuyang-liu16 / GlobalCom2
[AAAI 2026] Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models
☆34Updated 2 weeks ago
Dmmm1997 / SimVG
[NeurIPS2024] - SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion
☆99Updated 3 weeks ago
Dmmm1997 / C3VG
[AAAI2025 selected as oral] - Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints
☆42Updated 4 months ago
xmed-lab / TAM
[ICCV25 Oral] Token Activation Map to Visually Explain Multimodal LLMs
☆127Updated 3 months ago
codefanw / FlashSloth
[CVPR2025] FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression
☆55Updated last month
SEU-VIPGroup / Understanding_Vision_Tasks
☆12Updated 9 months ago
TungChintao / FlowCut
[NeurIPS 2025] Official repository for “FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models”
☆25Updated 2 months ago
924973292 / DeMo
【AAAI2025】DeMo: Decoupled Feature-Based Mixture of Experts for Multi-Modal Object Re-Identification
☆66Updated 8 months ago
ncTimTang / AKS
[CVPR 2025] Adaptive Keyframe Sampling for Long Video Understanding
☆132Updated 2 months ago
xjjxmu / TextRefiner
The official code for "TextRefiner: Internal Visual Feature as Efficient Refiner for Vision-Language Models Prompt Tuning" | [AAAI2025]
☆46Updated 8 months ago
64327069 / LVAgent
Code of LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
☆19Updated 4 months ago
xuyang-liu16 / VidCom2
[EMNLP 2025 Main] Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models
☆38Updated 2 weeks ago
924973292 / MambaPro
【AAAI2025】MambaPro: Multi-Modal Object Re-Identification with Mamba Aggregation and Synergistic Prompt
☆81Updated 6 months ago
fengxueguiren / CoPEFT
[AAAI 2025] CoPEFT: Fast Adaptation Framework for Multi-Agent Collaborative Perception with Parameter-Efficient Fine-Tuning
☆24Updated 7 months ago
mrwu-mac / ControlMLLM
[NeurIPS2024] Repo for the paper `ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models'
☆197Updated 4 months ago
appletea233 / AL-Ref-SAM2
[AAAI 2025] AL-Ref-SAM 2: Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video…
☆90Updated 11 months ago
seilk / LocalizationHeads
[CVPR 2025 Highlight] Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding
☆44Updated 2 months ago
nhw649 / EOV-Seg
[AAAI 2025] Official implementation of the paper "EOV-Seg: Efficient Open-Vocabulary Panoptic Segmentation"
☆33Updated 11 months ago
appletea233 / LLaVA-ST
[CVPR 2025] LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding
☆77Updated 4 months ago
seilk / VisAttnSink
[ICLR 2025] See What You Are Told: Visual Attention Sink in Large Multimodal Models
☆69Updated 9 months ago
RobertLuo1 / NeurIPS2023_SOC
[NeurIPS 2023] The official implementation of SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation
☆33Updated last year
SooLab / CoTDet
[ICCV2023] CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection
☆17Updated 7 months ago
fhgyuanshen / HybridGL
[CVPR 2025] Hybrid Global-Local Representation with Augmented Spatial Guidance for Zero-Shot Referring Image Segmentation
☆27Updated 4 months ago
xiaomoguhz / DeCLIP
[CVPR 2025] DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception
☆142Updated 5 months ago