[CVPR 2025] Adaptive Keyframe Sampling for Long Video Understanding
☆181Dec 19, 2025Updated 2 months ago
Alternatives and similar repositories for AKS
Users that are interested in AKS are comparing it to the libraries listed below
Sorting:
- TStar is a unified temporal search framework for long-form video question answering☆87Sep 2, 2025Updated 6 months ago
- [NeurIPS 2025 Spotlight] Official PyTorch implementation of Vgent☆40Nov 30, 2025Updated 3 months ago
- [NeurIPS 2024] Artemis: Towards Referential Understanding in Complex Videos☆27Apr 8, 2025Updated 10 months ago
- Agentic Keyframe Search for Video Question Answering☆16Apr 7, 2025Updated 10 months ago
- ☆29Jul 25, 2025Updated 7 months ago
- [CVPR 2025] DynRefer: Delving into Region-level Multimodal Tasks via Dynamic Resolution☆58Mar 4, 2025Updated 11 months ago
- [AAAI2025] ChatterBox: Multi-round Multimodal Referring and Grounding, Multimodal, Multi-round dialogues☆61May 2, 2025Updated 10 months ago
- [ECCV'24 Oral] PiTe: Pixel-Temporal Alignment for Large Video-Language Model☆17Feb 13, 2025Updated last year
- [NeurIPS 2025] FastVID: Dynamic Density Pruning for Fast Video Large Language Models☆26Nov 10, 2025Updated 3 months ago
- [ICCV'25] The official code of paper "Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models"☆70Jan 13, 2026Updated last month
- ☆31Sep 24, 2024Updated last year
- Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency☆60Jun 6, 2025Updated 8 months ago
- [CVPR 2025] DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models☆101Nov 22, 2025Updated 3 months ago
- [AAAI'25]: Building a Multi-modal Spatiotemporal Expert for Zero-shot Action Recognition with CLIP☆19Aug 5, 2025Updated 6 months ago
- WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs☆41Jan 26, 2026Updated last month
- WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning (CVPR 2026)☆55Dec 30, 2025Updated 2 months ago
- Implementation of paper "CC-Diff: Enhancing Contextual Coherence in Remote Sensing Image Synthesis"☆26Dec 19, 2025Updated 2 months ago
- (ICCV2025) Official repository of paper "ViSpeak: Visual Instruction Feedback in Streaming Videos"☆45Jul 1, 2025Updated 8 months ago
- Implementation of the model: "(MC-ViT)" from the paper: "Memory Consolidation Enables Long-Context Video Understanding"☆27Jan 17, 2026Updated last month
- [ECCV24] VISA: Reasoning Video Object Segmentation via Large Language Model☆19Jul 20, 2024Updated last year
- ☆22Dec 11, 2025Updated 2 months ago
- [AAAI 26 Demo] Offical repo for CAT-V - Caption Anything in Video: Object-centric Dense Video Captioning with Spatiotemporal Multimodal P…☆64Jan 27, 2026Updated last month
- MR. Video: MapReduce is the Principle for Long Video Understanding☆30Apr 23, 2025Updated 10 months ago
- Pytorch Code for "Unified Coarse-to-Fine Alignment for Video-Text Retrieval" (ICCV 2023)☆66Jun 7, 2024Updated last year
- ☆13May 17, 2025Updated 9 months ago
- [CVPR 2025] Code for "Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering".☆20Jun 16, 2025Updated 8 months ago
- WBSR: Rethinking Imbalance in Image Super-Resolution for Efficient Inference☆13Oct 8, 2024Updated last year
- TARS: MinMax Token-Adaptive Preference Strategy for Hallucination Reduction in MLLMs☆23Sep 21, 2025Updated 5 months ago
- Long Context Research☆26Jan 26, 2026Updated last month
- 🔥🔥First-ever hour scale video understanding models☆611Jul 14, 2025Updated 7 months ago
- Official Implementation (Pytorch) of the "VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Capti…☆23Jan 26, 2025Updated last year
- OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models☆55Feb 1, 2026Updated last month
- [NeurIPS 2024] Official PyTorch implementation of LoTLIP: Improving Language-Image Pre-training for Long Text Understanding☆50Jan 14, 2025Updated last year
- [ICCV 2025] LVBench: An Extreme Long Video Understanding Benchmark☆137Jul 9, 2025Updated 7 months ago
- [ICLR2023] Video Scene Graph Generation from Single-Frame Weak Supervision☆12Sep 17, 2023Updated 2 years ago
- [ICCV 2025] MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation☆20Sep 5, 2025Updated 5 months ago
- Collection of papers about video-audio understanding☆22Dec 26, 2025Updated 2 months ago
- [ICLR 2026] Official repo for "Spotlight on Token Perception for Multimodal Reinforcement Learning"☆49Jan 30, 2026Updated last month
- [ICCV 2025] Official code for "AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning"☆55Oct 9, 2025Updated 4 months ago