CircleRadon / TokenPacker
The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".
☆211Updated 2 weeks ago
Related projects ⓘ
Alternatives and complementary repositories for TokenPacker
- [ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization☆557Updated 5 months ago
- OmniTokenizer: one model and one weight for image-video joint tokenization.☆255Updated 4 months ago
- Mathematical Visual Instruction Tuning for Multi-modal Large Language Models☆109Updated 3 months ago
- An open-source implementation for training LLaVA-NeXT.☆386Updated 2 weeks ago
- SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree☆236Updated this week
- MLLM for On-Demand Spatial-Temporal Understanding at Arbitrary Resolution☆286Updated this week
- Official implementation of "Towards Efficient Visual Adaption via Structural Re-parameterization".☆197Updated 6 months ago
- [ECCV 2024] Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?☆148Updated last month
- 【NeurIPS 2024】Dense Connector for MLLMs☆133Updated 3 weeks ago
- Chain-of-Spot: Interactive Reasoning Improves Large Vision-language Models☆86Updated 7 months ago
- [NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Models☆227Updated last month
- (ECCV 2024) Empowering Multimodal Large Language Model as a Powerful Data Generator☆108Updated 3 weeks ago
- Official implementation of the Law of Vision Representation in MLLMs☆128Updated 2 months ago
- The official implementation of RAR☆72Updated 7 months ago
- A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models!☆117Updated 10 months ago
- LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models☆98Updated 5 months ago
- 【CVPR'2023 Highlight & TPAMI】Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?☆238Updated last month
- (AAAI 2024) BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions☆269Updated 6 months ago
- [NeurIPS2024] Repo for the paper `ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models'☆88Updated last month
- Code release for "UniVS: Unified and Universal Video Segmentation with Prompts as Queries" (CVPR2024)☆164Updated 4 months ago
- Multi-granularity Correspondence Learning from Long-term Noisy Videos [ICLR 2024, Oral]☆108Updated 6 months ago
- [NeurIPS 2024] This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"☆145Updated last month
- Evaluating text-to-image/video/3D models with VQAScore☆220Updated 2 months ago
- [ICCV 2023] Spectrum-guided Multi-granularity Referring Video Object Segmentation.☆81Updated 3 weeks ago
- [ECCV24] VISA: Reasoning Video Object Segmentation via Large Language Model☆123Updated 3 months ago
- A paper list of some recent works about Token Compress for Vit and VLM☆128Updated this week
- ☆103Updated 3 months ago
- A collection of visual instruction tuning datasets.☆76Updated 7 months ago
- [ECCV2024 Oral🔥] Official Implementation of "GiT: Towards Generalist Vision Transformer through Universal Language Interface"☆303Updated last month
- SVIT: Scaling up Visual Instruction Tuning☆163Updated 4 months ago