keshik6 / HourVideoLinks

[NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding

☆150

Alternatives and similar repositories for HourVideo

Users that are interested in HourVideo are comparing it to the libraries listed below

Sorting:

showlab / videollm-online
VideoLLM-online: Online Video Large Language Model for Streaming Video (CVPR 2024)
☆499Updated 2 months ago
Vchitect / RAPO
[CVPR 2025] The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation
☆72Updated 3 weeks ago
hustvl / OmniMamba
OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models
☆132Updated 2 months ago
jingyi0000 / R1-VL
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
☆403Updated 2 weeks ago
vaew / Awesome-spatial-visual-reasoning-MLLMs
Repository for awesome spatial/visual reasoning MLLMs. (focus more on embodied applications)
☆57Updated 2 weeks ago
mashijie1028 / GenHancer
(ICCV 2025) Enhance CLIP and MLLM's fine-grained visual representations with generative models.
☆68Updated 3 weeks ago
bytedance / Multi-Reward-Editing
Multi-Reward as Condition for Instruction-Based Image Editing
☆52Updated 3 months ago
Wiselnn570 / VideoRoPE
[ICML 2025 Oral] An official implementation of VideoRoPE: What Makes for Good Video Rotary Position Embedding?
☆170Updated last week
MCG-NJU / AWT
[NeurIPS 2024] AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation
☆103Updated 9 months ago
dongyh20 / Chain-of-Spot
Chain-of-Spot: Interactive Reasoning Improves Large Vision-language Models
☆96Updated last year
amap-cvlab / MV-Painter
☆244Updated 2 weeks ago
GWxuan / TSP3D
[CVPR 2025, All Strong Accept] TSP3D: Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding
☆210Updated last month
Alpha-Innovator / OmniCaptioner
Official Repository of OmniCaptioner
☆152Updated 2 months ago
lwpyh / CoS_codes
CoS: Chain-of-Shot Prompting for Long Video Understanding
☆48Updated 5 months ago
gordonhu608 / MQT-LLaVA
[NeurIPS 2024] Matryoshka Query Transformer for Large Vision-Language Models
☆110Updated last year
VITA-MLLM / Long-VITA
✨✨Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
☆290Updated 2 months ago
GalaxyGeneralRobotics / OpenWBT
Official implementation of OpenWBT.
☆654Updated 3 weeks ago
yongliu20 / UniLSeg
[CVPR 2024] Official implementation of "Universal Segmentation at Arbitrary Granularity with Language Instruction"
☆209Updated last year
moxin-org / Moxin-LLM
Moxin is a family of fully open-source and reproducible LLMs
☆598Updated 2 weeks ago
steven-ccq / ViLAMP
[ICML 2025] Official repository for paper "Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation"
☆159Updated 2 months ago
FanbinLu / STEVE-R1
R1-like Computer-use Agent
☆77Updated 3 months ago
zhaohengyuan1 / Genixer
(ECCV 2024) Empowering Multimodal Large Language Model as a Powerful Data Generator
☆111Updated 3 months ago
Perceive-Anything / PAM
Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos
☆240Updated 2 weeks ago
jqtangust / hawk
🔥 🔥 🔥 [NeurIPS 2024] Official Implementation of Hawk: Learning to Understand Open-World Video Anomalies
☆211Updated 3 months ago
fletcherjiang / LLMEPET
[MM'24 Oral] Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval
☆127Updated 10 months ago
FoundationVision / Groma
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
☆574Updated last year
Baiqi-Li / NaturalBench
🚀 [NeurIPS24] Make Vision Matter in Visual-Question-Answering (VQA)! Introducing NaturalBench, a vision-centric VQA benchmark (NeurIPS'2…
☆84Updated 3 weeks ago
ZrrSkywalker / MathVerse
[ECCV 2024] Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
☆165Updated 2 months ago
whwu95 / GPT4Vis
GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?
☆186Updated last year
DAMO-NLP-SG / VideoRefer
[CVPR 2025] The code for "VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM"
☆235Updated 3 weeks ago