OpenGVLab / PVC
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
β25Updated 2 months ago
Alternatives and similar repositories for PVC:
Users that are interested in PVC are comparing it to the libraries listed below
- β26Updated 6 months ago
- Learning 1D Causal Visual Representation with De-focus Attention Networksβ32Updated 8 months ago
- π₯ [CVPR 2024] Official implementation of "See, Say, and Segment: Teaching LMMs to Overcome False Premises (SESAME)"β32Updated 8 months ago
- This is the official repo for ByteVideoLLM/Dynamic-VLMβ19Updated 2 months ago
- [NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effectβ¦β35Updated 8 months ago
- [ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmarkβ75Updated 3 weeks ago
- [NeurIPS 2024] Official PyTorch implementation of "Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives"β33Updated 2 months ago
- Retrieval-Augmented Personalizationβ13Updated 2 months ago
- [NeurIPS 2024 D&B Track] Official Repo for "LVD-2M: A Long-take Video Dataset with Temporally Dense Captions"β45Updated 4 months ago
- [AAAI 2025] HiRED strategically drops visual tokens in the image encoding stage to improve inference efficiency for High-Resolution Visioβ¦β23Updated 2 weeks ago
- [ECCV 2024] This is the official implementation of "Stitched ViTs are Flexible Vision Backbones".β27Updated last year
- [ArXiv] V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encodingβ29Updated 2 months ago
- The official repository for paper "PruneVid: Visual Token Pruning for Efficient Video Large Language Models".β29Updated this week
- [NeurIPS 2024] Efficient Large Multi-modal Models via Visual Context Compressionβ51Updated last week
- This is the official PyTorch implementation of "ZipAR: Accelerating Auto-regressive Image Generation through Spatial Locality"β45Updated last month
- [ECCV2024] Learning Video Context as Interleaved Multimodal Sequencesβ35Updated last month
- [ICLR 2025] IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Modelβ27Updated 2 months ago
- β58Updated last year
- Code release for "SegLLM: Multi-round Reasoning Segmentation"β66Updated 3 weeks ago
- Code Release of F-LMM: Grounding Frozen Large Multimodal Modelsβ62Updated 6 months ago
- [ECCV 2024] AdaNAT: Exploring Adaptive Policy for Token-Based Image Generationβ33Updated 5 months ago
- β23Updated last month
- Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervisionβ29Updated 3 months ago
- The official code of the paper "PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction".β52Updated last month
- Official Pytorch implementation for LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior (ICLR 2025 Oral).β50Updated last week