OpenGVLab / FluxViTLinks
Make Your Training Flexible: Towards Deployment-Efficient Video Models
☆30Updated 2 months ago
Alternatives and similar repositories for FluxViT
Users that are interested in FluxViT are comparing it to the libraries listed below
Sorting:
- PyTorch code for "ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning"☆20Updated 9 months ago
- [CVPR 2025]Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction☆126Updated 4 months ago
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆54Updated 3 weeks ago
- Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model☆105Updated this week
- Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model☆68Updated 6 months ago
- [ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models☆28Updated last month
- An open source implementation of CLIP (With TULIP Support)☆162Updated 2 months ago
- OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv 2024☆60Updated 5 months ago
- 🤖 [ICLR'25] Multimodal Video Understanding Framework (MVU)☆45Updated 6 months ago
- ☆180Updated 9 months ago
- Official Implementation for our NeurIPS 2024 paper, "Don't Look Twice: Run-Length Tokenization for Faster Video Transformers".☆222Updated 4 months ago
- Official implementation of paper VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interact…☆33Updated 6 months ago
- VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning☆169Updated 2 months ago
- Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models☆119Updated 4 months ago
- Official implementation of Add-SD: Rational Generation without Manual Reference.☆27Updated 11 months ago
- [ECCV'24 Workshops Oral] DALDA: Data Augmentation Leveraging Diffusion Model and LLM with Adaptive Guidance Scaling☆31Updated 9 months ago
- ☆87Updated last month
- [T-PAMI 2025] EMOv2: Pushing 5M Vision Model Frontier☆46Updated 7 months ago
- Code for the paper "Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers" [ICCV 2025]☆82Updated 2 weeks ago
- ☆75Updated 5 months ago
- Official code of the paper "VideoMolmo: Spatio-Temporal Grounding meets Pointing"☆47Updated last month
- [EMNLP 2023] TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding☆51Updated last year
- [ICLR 2025] CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion☆48Updated last month
- Video-LlaVA fine-tune for CinePile evaluation☆51Updated last year
- [CVPR'24 Highlight] PyTorch Implementation of Object Recognition as Next Token Prediction☆180Updated 3 months ago
- Official implementation of "TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models"☆40Updated 3 weeks ago
- OpenVLThinker: An Early Exploration to Vision-Language Reasoning via Iterative Self-Improvement☆104Updated 2 weeks ago
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model☆42Updated last year
- Code for "Scaling Language-Free Visual Representation Learning" paper (Web-SSL).☆173Updated 3 months ago
- [CVPR'2025] VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".☆180Updated last month