OpenGVLab / FluxViTLinks
Make Your Training Flexible: Towards Deployment-Efficient Video Models
☆30Updated 2 months ago
Alternatives and similar repositories for FluxViT
Users that are interested in FluxViT are comparing it to the libraries listed below
Sorting:
- [CVPR 2025]Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction☆129Updated 5 months ago
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆56Updated last month
- PyTorch code for "ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning"☆20Updated 10 months ago
- An open source implementation of CLIP (With TULIP Support)☆162Updated 3 months ago
- Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model☆108Updated 3 weeks ago
- [ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models☆32Updated 2 months ago
- ☆182Updated 10 months ago
- Official implementation of paper VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interact…☆34Updated 6 months ago
- [EMNLP 2025 Findings] Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models☆121Updated last week
- Official Implementation for our NeurIPS 2024 paper, "Don't Look Twice: Run-Length Tokenization for Faster Video Transformers".☆222Updated 5 months ago
- 🤖 [ICLR'25] Multimodal Video Understanding Framework (MVU)☆47Updated 7 months ago
- Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model☆69Updated 7 months ago
- ☆87Updated 2 months ago
- OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv 2024☆60Updated 6 months ago
- [EMNLP 2023] TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding☆51Updated last year
- VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning☆179Updated 2 weeks ago
- ☆78Updated 5 months ago
- Repo for paper "T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs"☆49Updated 5 months ago
- [ICCV 2025] Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges☆72Updated 6 months ago
- Code for the paper "Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers" [ICCV 2025]☆83Updated last month
- [ICLR 2025] CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion☆49Updated 2 months ago
- Code for CVPR25 paper "VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos"☆134Updated 2 months ago
- ☆139Updated 11 months ago
- LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning☆148Updated 3 weeks ago
- ☆35Updated 11 months ago
- [ECCV 2024🔥] Official implementation of the paper "ST-LLM: Large Language Models Are Effective Temporal Learners"☆150Updated 11 months ago
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture☆210Updated 7 months ago
- [ECCV'24 Workshops Oral] DALDA: Data Augmentation Leveraging Diffusion Model and LLM with Adaptive Guidance Scaling☆31Updated 9 months ago
- Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning☆109Updated last week
- [ICCV'25] HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics☆33Updated 2 weeks ago