TIGER-AI-Lab / PixelWorldLinks
The official code of "PixelWorld: Towards Perceiving Everything as Pixels"
☆14Updated 6 months ago
Alternatives and similar repositories for PixelWorld
Users that are interested in PixelWorld are comparing it to the libraries listed below
Sorting:
- Code for the paper "Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers" [ICCV 2025]☆82Updated 2 weeks ago
- [ICLR 2025] CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion☆48Updated last month
- Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?☆63Updated last month
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆54Updated 3 weeks ago
- On Path to Multimodal Generalist: General-Level and General-Bench☆19Updated last month
- [ICLR 2025] Source code for paper "A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegr…☆76Updated 8 months ago
- Official implementation of ECCV24 paper: POA☆24Updated last year
- High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning☆45Updated 3 weeks ago
- \infty-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation☆15Updated 5 months ago
- Official Implementation of Muddit [Meissonic II]: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model.☆75Updated last week
- ☆37Updated 2 months ago
- A benchmark dataset and simple code examples for measuring the perception and reasoning of multi-sensor Vision Language models.☆18Updated 7 months ago
- PyTorch code for "ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning"☆20Updated 9 months ago
- OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv 2024☆60Updated 5 months ago
- Code for Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense? [COLM 2024]☆22Updated last year
- Official code of the paper "VideoMolmo: Spatio-Temporal Grounding meets Pointing"☆47Updated last month
- This repo contains evaluation code for the paper "AV-Odyssey: Can Your Multimodal LLMs Really Understand Audio-Visual Information?"☆26Updated 7 months ago
- ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration☆47Updated 7 months ago
- [ECCV'24 Oral] PiTe: Pixel-Temporal Alignment for Large Video-Language Model☆17Updated 6 months ago
- ☆23Updated 4 months ago
- [CVPR 2025] PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models☆45Updated 2 months ago
- [CVPR'24] Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities☆99Updated last year
- [ACL2025 Findings] Benchmarking Multihop Multimodal Internet Agents☆46Updated 5 months ago
- Official implementation of RMoE (Layerwise Recurrent Router for Mixture-of-Experts)☆22Updated last year
- [ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models☆28Updated last month
- [NeurIPS 2024] Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective☆69Updated 9 months ago
- The code for "VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by VIdeo SpatioTemporal Augmentation" [CVPR2025]☆19Updated 5 months ago
- Official implementation of "Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology"☆49Updated last month
- HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation☆63Updated 5 months ago
- Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better☆36Updated last month