KAIST-Visual-AI-Group / APC-VLMLinks
Official code for Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation
☆29Updated last month
Alternatives and similar repositories for APC-VLM
Users that are interested in APC-VLM are comparing it to the libraries listed below
Sorting:
- ☆36Updated last month
- [CVPR 2025] GPS as a Control Signal for Image Generation☆18Updated 2 months ago
- Official implementation of the WACV 2025 paper "3D Part Segmentation via Geometric Aggregation of 2D Visual Features"☆18Updated 2 months ago
- Code for "VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement"☆47Updated 6 months ago
- the official repo for "D-AR: Diffusion via Autoregressive Models"☆30Updated last week
- Official repo of the ICLR 2025 paper "MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos"☆28Updated 8 months ago
- [ICLR 2025] Official implementation and benchmark evaluation repository of <PhysBench: Benchmarking and Enhancing Vision-Language Models …☆64Updated this week
- ☆36Updated 2 weeks ago
- FleVRS: Towards Flexible Visual Relationship Segmentation, NeurIPS 2024☆20Updated 5 months ago
- Official Implementation of ISR-DPO:Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO (AAAI'25)☆19Updated 3 months ago
- Program synthesis for 3D spatial reasoning☆32Updated 3 months ago
- ☆22Updated 7 months ago
- [arXiv: 2502.05178] QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation☆74Updated 3 months ago
- [ICLR 2024] Official implementation of the paper "Toss: High-quality text-guided novel view synthesis from a single image"☆22Updated last year
- [NeurIPS 2024] Official PyTorch implementation of "Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives"☆40Updated 6 months ago
- Code for paper "Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning"☆38Updated last year
- Official implementation for CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding☆45Updated last year
- [ICLR 2025] CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion☆45Updated 4 months ago
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆51Updated 5 months ago
- [ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models☆27Updated last month
- ☆14Updated 7 months ago
- Official implementation of PARIS3D (Accepted to ECCV 2024).☆24Updated 8 months ago
- [ICCV 2023] Code for "Multi-task View Synthesis with Neural Radiance Fields"☆11Updated last year
- Code for "AVG-LLaVA: A Multimodal Large Model with Adaptive Visual Granularity"☆28Updated 7 months ago
- Unifying Specialized Visual Encoders for Video Language Models☆18Updated last week
- Github repository for "Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas" (ICML 2025)☆30Updated last month
- Official implementation of "Reangle-A-Video: 4D Video Generation as Video-to-Video Translation"☆41Updated 2 months ago
- 🔥 [CVPR 2024] Official implementation of "See, Say, and Segment: Teaching LMMs to Overcome False Premises (SESAME)"☆40Updated 11 months ago
- Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces☆35Updated this week
- ☆11Updated 2 weeks ago