SxJyJay / UniToken
UniToken is an auto-regressive generation model that combines discrete and continuous representations to process visual inputs, making it easy to integrate both visual understanding and image generation tasks seamlessly.
☆35Updated last month
Alternatives and similar repositories for UniToken:
Users that are interested in UniToken are comparing it to the libraries listed below
- EventHallusion: Diagnosing Event Hallucinations in Video LLMs☆30Updated 2 months ago
- [NeurIPS 2024] Lumen: a Large multimodal model with versatile vision-centric capabilities☆24Updated 6 months ago
- ☆18Updated 2 months ago
- [ACM MM 2024] ReToMe-VA: Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack☆11Updated 3 months ago
- This is the official implementation of our paper "QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehens…☆63Updated last week
- ☆107Updated last month
- Video-R1: Towards Super Reasoning Ability in Video Understanding MLLMs☆105Updated last month
- ☆23Updated 5 months ago
- ☆50Updated last week
- [ICLR'25] Official code for the paper 'MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs'☆89Updated last week
- ☆90Updated this week
- [ECCV2024] Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models☆16Updated 8 months ago
- Official repository for CoMM Dataset☆30Updated 2 months ago
- [ACM MM2023] Code Release of GCMA: Generative Cross-Modal Transferable Adversarial Attacks from Images to Videos☆12Updated 11 months ago
- 🔥CVPR 2025 Multimodal Large Language Models Paper List☆119Updated 2 weeks ago
- Envolving Temporal Reasoning Capability into LMMs via Temporal Consistent Reward☆18Updated last week
- ☆18Updated 3 months ago
- (CVPR 2025) PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction☆81Updated 3 weeks ago
- The official code of the paper "Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate".☆96Updated 4 months ago
- [NeurIPS 2024] Visual Perception by Large Language Model’s Weights☆41Updated 5 months ago
- VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models☆49Updated 8 months ago
- [CVPR 2025] VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?☆21Updated last week
- Official PyTorch code of "Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation".☆22Updated last month
- [NeurIPS 2024] This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"☆169Updated 6 months ago
- ☆40Updated 2 months ago
- HallE-Control: Controlling Object Hallucination in LMMs☆30Updated 11 months ago
- 【NeurIPS 2024】The official code of paper "Automated Multi-level Preference for MLLMs"☆19Updated 6 months ago
- The official implementation of A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation☆17Updated 4 months ago
- WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation☆53Updated last week
- [ICCV 2023] Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting☆15Updated last year