ali-vilab / CAPabilityLinks

What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

☆25

Alternatives and similar repositories for CAPability

Users that are interested in CAPability are comparing it to the libraries listed below

Sorting:

TencentARC / TimeLens
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
☆81Updated 3 weeks ago
CSU-JPG / TextAtlas
A Large-scale Dataset for training and evaluating model's ability on Dense Text Image Generation
☆86Updated 3 months ago
TimeMarker-LLM / TimeMarker
A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
☆104Updated last year
yongliang-wu / NumPro
[CVPR2025] Number it: Temporal Grounding Videos like Flipping Manga
☆139Updated last week
yeliudev / R2-Tuning
🌀 R2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding (ECCV 2024)
☆90Updated last year
zjuruizhechen / TVG-R1
[EMNLP 2025 Industry] Datasets and Recipes for Video Temporal Grounding via Reinforcement Learning
☆35Updated 2 months ago
SCZwangxiao / video-FlexReduc
Official implementation of paper AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding
☆88Updated 8 months ago
gyxxyg / TRACE
[ICLR 2025] TRACE: Temporal Grounding Video LLM via Casual Event Modeling
☆143Updated 4 months ago
baaivision / DIVA
[ICLR 2025] Diffusion Feedback Helps CLIP See Better
☆300Updated 11 months ago
showlab / VideoLISA
[NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
☆143Updated last year
rese1f / aurora
[ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
☆137Updated 7 months ago
WHB139426 / Grounded-Video-LLM
[EMNLP 2025 Findings] Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
☆139Updated 4 months ago
hzphzp / WeGen
☆27Updated 8 months ago
PKU-YuanGroup / WISE
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
☆176Updated 2 months ago
TencentARC / TokLIP
TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
☆235Updated 4 months ago
zhang9302002 / ThinkingWithVideos
The official code of "Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning"
☆75Updated 2 months ago
EvolvingLMMs-Lab / LongVT
Incentivizing "Thinking with Long Videos" via Native Tool Calling
☆166Updated this week
Zhuo-Cao / FlashVTG
FlashVTG: Feature Layering and Adaptive Score Handling Network for Video Temporal Grounding. (WACV2025)
☆33Updated 8 months ago
hshjerry / VideoEspresso
[CVPR 2025 Oral] VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
☆131Updated 5 months ago
Wang-Xiaodong1899 / CVPR25-MLLM-Paper-List
🔥CVPR 2025 Multimodal Large Language Models Paper List
☆154Updated 9 months ago
MengLcool / SliMM
☆24Updated last year
PhoenixZ810 / RISEBench
[NIPS 2025 DB Oral] Official Repository of paper: Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
☆134Updated last week
appletea233 / Temporal-R1
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency
☆60Updated 7 months ago
OpenGVLab / VideoChat-R1
[NIPS2025] VideoChat-R1 & R1.5: Enhancing Spatio-Temporal Perception and Reasoning via Reinforcement Fine-Tuning
☆252Updated 2 months ago
zai-org / LVBench
[ICCV 2025] LVBench: An Extreme Long Video Understanding Benchmark
☆131Updated 6 months ago
Visual-AI / PruneVid
[ACL 2025] PruneVid: Visual Token Pruning for Efficient Video Large Language Models
☆63Updated 7 months ago
mutonix / Vript
☆158Updated 11 months ago
qiujihao19 / Artemis
☆27Updated 9 months ago
hanghuacs / FineCaption
☆37Updated 6 months ago
ncTimTang / AKS
[CVPR 2025] Adaptive Keyframe Sampling for Long Video Understanding
☆157Updated 3 weeks ago