☆21Jan 17, 2025Updated last year
Alternatives and similar repositories for OpenTokenizer
Users that are interested in OpenTokenizer are comparing it to the libraries listed below
Sorting:
- ☆24Dec 26, 2024Updated last year
- UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation☆35Nov 24, 2025Updated 3 months ago
- [NeurIPS 2024]OmniTokenizer: one model and one weight for image-video joint tokenization.☆322Jul 9, 2024Updated last year
- Official repository of "CoMP: Continual Multimodal Pre-training for Vision Foundation Models"☆45Apr 3, 2025Updated 10 months ago
- Open-source red teaming framework for MLLMs with 37+ attack methods☆226Jan 16, 2026Updated last month
- ☆54Jun 4, 2024Updated last year
- [ACM MM2023] Code Release of GCMA: Generative Cross-Modal Transferable Adversarial Attacks from Images to Videos☆12Mar 29, 2024Updated last year
- Pytorch implementation for the paper titled "SimpleAR: Pushing the Frontier of Autoregressive Visual Generation"☆426Jun 20, 2025Updated 8 months ago
- Adversarial Examples Detection Benchmark☆17Dec 6, 2024Updated last year
- [Preprint] GMem: A Modular Approach for Ultra-Efficient Generative Models☆43Mar 11, 2025Updated 11 months ago
- [ACM MM 2024] ReToMe-VA: Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack☆14Dec 20, 2024Updated last year
- [ECCV24] VISA: Reasoning Video Object Segmentation via Large Language Model☆19Jul 20, 2024Updated last year
- [CVPR 2024] The official implementation of paper "synthesize, diagnose, and optimize: towards fine-grained vision-language understanding"☆52Jun 16, 2025Updated 8 months ago
- [CVPRW 2025] UniToken is an auto-regressive generation model that combines discrete and continuous representations to process visual inpu…☆105Apr 23, 2025Updated 10 months ago
- Official Implementation of "Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning"☆25Dec 16, 2025Updated 2 months ago
- [NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effect…☆79Jun 17, 2024Updated last year
- [ECCV-24] This is the official implementation of the paper "SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation".☆27Oct 13, 2024Updated last year
- ☆134Dec 22, 2023Updated 2 years ago
- ☆24Oct 28, 2024Updated last year
- [NeurIPS 2024] Lumen: a Large multimodal model with versatile vision-centric capabilities☆25Sep 27, 2024Updated last year
- [AAAI2022] Code Release of Attacking Video Recognition Models with Bullet-Screen Comments☆25Mar 30, 2024Updated last year
- A unified framework for controllable caption generation across images, videos, and audio. Supports multi-modal inputs and customizable ca…☆52Jul 24, 2025Updated 7 months ago
- Offical implementation of "Re-Aligning Language to Visual Objects with an Agentic Workflow"☆31Apr 20, 2025Updated 10 months ago
- Twinkle✨: Training workbench to make your model glow.☆45Updated this week
- The official implementation of our work Hawkeye: Discovering and Grounding Implicit Anomalous Sentiment in Recon-videos via Scene-enhanc…☆12Oct 14, 2024Updated last year
- EventHallusion: Diagnosing Event Hallucinations in Video LLMs☆34Aug 5, 2025Updated 6 months ago
- ☆39Dec 4, 2023Updated 2 years ago
- Open-vocabulary Semantic Segmentation☆33Feb 16, 2024Updated 2 years ago
- [AAAI2025 selected as oral] - Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints☆44Jul 2, 2025Updated 7 months ago
- DisTime: Distribution-based Time Representation for Video Large Language Models.☆18Jul 10, 2025Updated 7 months ago
- The repository of VG-Refiner paper☆17Dec 9, 2025Updated 2 months ago
- [CVPR 2025 🔥]A Large Multimodal Model for Pixel-Level Visual Grounding in Videos☆96Apr 14, 2025Updated 10 months ago
- Finetuning & extending DiffusionDet to video & pedestrian multi-object-tracking☆13Apr 12, 2023Updated 2 years ago
- [ICLR 2025] IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model☆37Nov 27, 2024Updated last year
- ☆141Oct 15, 2025Updated 4 months ago
- OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models☆29Feb 4, 2026Updated 3 weeks ago
- [CVPR 2024] LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation☆13Jun 17, 2024Updated last year
- Modern normalizing flows in Python. Simple to use and easily extensible.☆12Feb 11, 2026Updated 2 weeks ago
- ☆11Jan 18, 2025Updated last year