yliu-cs / PiTeLinks

[ECCV'24 Oral] PiTe: Pixel-Temporal Alignment for Large Video-Language Model

☆17

Alternatives and similar repositories for PiTe

Users that are interested in PiTe are comparing it to the libraries listed below

Sorting:

mlvlab / DeepVideoR1
[NeurIPS25] Official Implementation (Pytorch) of "DeepVideo-R1"
☆28Updated last week
Yu-xm / Unicorn
Text-Only Data Synthesis for Vision Language Model Training
☆22Updated 5 months ago
Andy-Cheng / TEMPURA
TEMPURA enables video-language models to reason about causal event relationships and generate fine-grained, timestamped descriptions of u…
☆23Updated 5 months ago
path2generalist / General-Level
On Path to Multimodal Generalist: General-Level and General-Bench
☆19Updated 4 months ago
weixi-feng / TC-Bench
☆25Updated last year
Vision-CAIR / Infinibench
Official InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows
☆18Updated 2 weeks ago
TIGER-AI-Lab / VISTA
The code for "VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by VIdeo SpatioTemporal Augmentation" [CVPR2025]
☆20Updated 8 months ago
qishisuren123 / AnyCap
A unified framework for controllable caption generation across images, videos, and audio. Supports multi-modal inputs and customizable ca…
☆52Updated 3 months ago
EvolvingLMMs-Lab / MGPO
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
☆51Updated 3 months ago
RenShuhuai-Andy / NBP
Official implementation of Next Block Prediction: Video Generation via Semi-Autoregressive Modeling
☆39Updated 9 months ago
showlab / MovieSeq
[ECCV 2024] Learning Video Context as Interleaved Multimodal Sequences
☆40Updated 8 months ago
Gen-Verse / HermesFlow
[NeurIPS 2025] HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
☆71Updated 2 months ago
zehanwang01 / OmniBind
☆33Updated 7 months ago
HumanMLLM / ViSpeak
(ICCV2025) Official repository of paper "ViSpeak: Visual Instruction Feedback in Streaming Videos"
☆40Updated 4 months ago
showlab / FQGAN
FQGAN: Factorized Visual Tokenization and Generation
☆54Updated 7 months ago
Hon-Wong / ByteVideoLLM
[ICCV 2025] Dynamic-VLM
☆26Updated 11 months ago
TencentARC / Video-Holmes
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
☆77Updated 4 months ago
Pepper-lll / LMforImageGeneration
Codebase for the paper-Elucidating the design space of language models for image generation
☆46Updated last year
TIGER-AI-Lab / Vamba
Code for the paper "Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers" [ICCV 2025]
☆94Updated 3 months ago
TencentARC / MindOmni
☆132Updated last month
AV-Odyssey / AV-Odyssey
This repo contains evaluation code for the paper "AV-Odyssey: Can Your Multimodal LLMs Really Understand Audio-Visual Information?"
☆30Updated 10 months ago
zhijie-group / UniCMs
☆39Updated 6 months ago
Yui010206 / CREMA
[ICLR 2025] CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
☆54Updated 4 months ago
waltonfuture / RL-with-Cold-Start
SFT+RL boosts multimodal reasoning
☆37Updated 4 months ago
daeunni / VideoRepair
Code for "VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement"
☆50Updated 11 months ago
OmniMMI / OmniMMI
[CVPR 2025] OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts
☆19Updated 7 months ago
DAMO-NLP-SG / DiGIT
[NeurIPS 2024] Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective
☆72Updated last year
NVlabs / QLIP
[arXiv: 2502.05178] QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation
☆93Updated 8 months ago
showlab / EvolveDirector
[NeurIPS 2024] EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models.
☆50Updated last year
SilentView / LVD-2M
[NeurIPS 2024 D&B Track] Official Repo for "LVD-2M: A Long-take Video Dataset with Temporally Dense Captions"
☆73Updated last year