OpenGVLab / FluxViTLinks

Make Your Training Flexible: Towards Deployment-Efficient Video Models

☆34

Alternatives and similar repositories for FluxViT

Users that are interested in FluxViT are comparing it to the libraries listed below

Sorting:

OpenGVLab / TPO
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
☆62Updated 4 months ago
Mark12Ding / Dispider
[CVPR 2025]Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
☆145Updated 8 months ago
Amshaker / Mobile-VideoGPT
Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model
☆127Updated 3 months ago
Hao840 / ADEM-VL
PyTorch code for "ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning"
☆20Updated last year
HaroldChen19 / VistaDPO
[ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
☆36Updated 5 months ago
tulip-berkeley / open_clip
An open source implementation of CLIP (With TULIP Support)
☆163Updated 6 months ago
TencentARC / SEED-Bench-R1
☆94Updated 5 months ago
SHI-Labs / VisPer-LM
[NeurIPS 2025] Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation, arXiv 2024
☆64Updated last month
NVlabs / LITA
☆189Updated last year
OpenGVLab / vinci
Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model
☆76Updated 10 months ago
HumanMLLM / ViSpeak
(ICCV2025) Official repository of paper "ViSpeak: Visual Instruction Feedback in Streaming Videos"
☆40Updated 4 months ago
RenShuhuai-Andy / TESTA
[EMNLP 2023] TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
☆49Updated last year
Yui010206 / CREMA
[ICLR 2025] CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
☆54Updated 4 months ago
rccchoudhury / rlt
Official Implementation for our NeurIPS 2024 paper, "Don't Look Twice: Run-Length Tokenization for Faster Video Transformers".
☆228Updated 7 months ago
xjtupanda / Sparrow
Repo for paper "T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs"
☆48Updated 2 months ago
kkyuhun94 / dalda
[ECCV'24 Workshops Oral] DALDA: Data Augmentation Leveraging Diffusion Model and LLM with Adaptive Guidance Scaling
☆30Updated last year
bigai-nlco / VideoLLaMB
[ICCV 2025] Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges
☆78Updated 8 months ago
marinero4972 / Open-o3-Video
Official implementation of "Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence"
☆116Updated last week
DAMO-NLP-SG / DiGIT
[NeurIPS 2024] Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective
☆72Updated last year
yellow-binary-tree / MMDuet
Official implementation of paper VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interact…
☆36Updated 9 months ago
OpenGVLab / VideoChat-R1
[NIPS2025] VideoChat-R1 & R1.5: Enhancing Spatio-Temporal Perception and Reasoning via Reinforcement Fine-Tuning
☆227Updated last month
Yxxxb / VoCo-LLaMA
[CVPR'2025] VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".
☆194Updated 5 months ago
WeihuangLin / INF-LLaVA
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model
☆42Updated last year
KD-TAO / OmniZip
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models
☆22Updated this week
rese1f / aurora
[ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
☆131Updated 5 months ago
geshang777 / pix2cap
Official Implementation of "Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning"
☆23Updated last month
SCZwangxiao / video-FlexReduc
Official implementation of paper AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding
☆88Updated 7 months ago
ylingfeng / Add-SD
Official implementation of Add-SD: Rational Generation without Manual Reference.
☆28Updated last year
WHB139426 / Grounded-Video-LLM
[EMNLP 2025 Findings] Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
☆135Updated 3 months ago
shaochenze / EAR
☆35Updated 6 months ago