NOVAglow646 / MonetLinks
Official codes of "Monet: Reasoning in Latent Visual Space Beyond Image and Language"
☆31Updated last week
Alternatives and similar repositories for Monet
Users that are interested in Monet are comparing it to the libraries listed below
Sorting:
- [MTI-LLM@NeurIPS 2025] Official implementation of "PyVision: Agentic Vision with Dynamic Tooling."☆137Updated 4 months ago
- [ACL2025 Oral & Award] Evaluate Image/Video Generation like Humans - Fast, Explainable, Flexible☆109Updated 3 months ago
- CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning☆32Updated 3 months ago
- ☆63Updated 4 months ago
- Official implementation of "Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs".☆94Updated 3 weeks ago
- [ACL2025 Findings] Benchmarking Multihop Multimodal Internet Agents☆47Updated 9 months ago
- [NeurIPS 2025 Oral] Exploring Diffusion Transformer Designs via Grafting☆64Updated 5 months ago
- Official PyTorch implementation of TokenSet.☆127Updated 8 months ago
- NeuMeta transforms neural networks by allowing a single model to adapt on the fly to different sizes, generating the right weights when n…☆43Updated last year
- PhysGame Benchmark for Physical Commonsense Evaluation in Gameplay Videos☆46Updated 5 months ago
- [NeurIPS 2025] Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation, arXiv 2024☆66Updated last month
- ☆39Updated 6 months ago
- OpenVLThinker: An Early Exploration to Vision-Language Reasoning via Iterative Self-Improvement☆122Updated 4 months ago
- [NeurIPS 2024] A task generation and model evaluation system for multimodal language models.☆73Updated last year
- VCode: SVG as Symbolic Visual Representation☆112Updated last week
- PICABench: How Far Are We from Physically Realistic Image Editing?☆30Updated 3 weeks ago
- [NeurIPS 2024 D&B] VideoGUI: A Benchmark for GUI Automation from Instructional Videos☆48Updated 5 months ago
- DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning☆163Updated 2 weeks ago
- Code and data for the paper: Learning Action and Reasoning-Centric Image Editing from Videos and Simulation☆31Updated 5 months ago
- NEO Series: Native Vision-Language Models from First Principles☆225Updated last month
- ☆68Updated 2 months ago
- Official implementation of Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents (NeurIPS 2025)☆43Updated last week
- [ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models☆37Updated 5 months ago
- [ICLR 2025] Source code for paper "A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegr…☆79Updated 11 months ago
- Uni-CoT: Towards Unified Chain-of-Thought Reasoning Across Text and Vision☆173Updated last week
- [CVPR 2025] Science-T2I: Addressing Scientific Illusions in Image Synthesis☆62Updated 7 months ago
- Pytorch implementation of HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models☆28Updated last year
- Official PyTorch Implementation for Dual-Process Image Generation, ICCV 2025☆112Updated 3 months ago
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model☆42Updated last year
- The official repo of VideoAgentTrek☆34Updated last month