TRI-ML / prismatic-vlms
A flexible and efficient codebase for training visually-conditioned language models (VLMs)
β652Updated 9 months ago
Alternatives and similar repositories for prismatic-vlms:
Users that are interested in prismatic-vlms are comparing it to the libraries listed below
- Compose multimodal datasets πΉβ351Updated this week
- Official Repo for Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learningβ341Updated 4 months ago
- Implementation of "PaLM-E: An Embodied Multimodal Language Model"β299Updated last year
- β332Updated 3 months ago
- Official repo and evaluation implementation of VSI-Benchβ463Updated last month
- Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Successβ343Updated 3 weeks ago
- Evaluating and reproducing real-world robot manipulation policies (e.g., RT-1, RT-1-X, Octo) in simulation under common setups (e.g., Gooβ¦β595Updated 3 weeks ago
- β610Updated last year
- VLM Evaluation: Benchmark for VLMs, spanning text generation tasks from VQA to Captioningβ108Updated 7 months ago
- Heterogeneous Pre-trained Transformer (HPT) as Scalable Policy Learner.β485Updated 4 months ago
- [AAAI-25] Cobra: Extending Mamba to Multi-modal Large Language Model for Efficient Inferenceβ272Updated 3 months ago
- OpenEQA Embodied Question Answering in the Era of Foundation Modelsβ272Updated 7 months ago
- Cosmos-Reason1 models understand the physical common sense and generate appropriate embodied decisions in natural language through long cβ¦β295Updated 3 weeks ago
- Embodied Chain of Thought: A robotic policy that reason to solve the task.β225Updated 2 weeks ago
- Embodied Reasoning Question Answer (ERQA) Benchmarkβ139Updated last month
- Recent LLM-based CV and related works. Welcome to comment/contribute!β862Updated last month
- The official repo for "SpatialBot: Precise Spatial Understanding with Vision Language Models.β243Updated 2 months ago
- [CVPR 2024 π₯] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses thaβ¦β866Updated 5 months ago
- [ICLR 2025] LAPA: Latent Action Pretraining from Videosβ235Updated 3 months ago
- CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasksβ539Updated 2 months ago
- Implementation of Οβ, the robotic foundation model architecture proposed by Physical Intelligenceβ397Updated this week
- [Neurips'24 Spotlight] Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought β¦β298Updated 4 months ago
- A Framework of Small-scale Large Multimodal Modelsβ800Updated 3 weeks ago
- Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Modelβ359Updated 10 months ago
- Suite of human-collected datasets and a multi-task continuous control benchmark for open vocabulary visuolinguomotor learning.β313Updated last week
- Pytorch implementation of the models RT-1-X and RT-2-X from the paper: "Open X-Embodiment: Robotic Learning Datasets and RT-X Models"β203Updated 2 weeks ago
- [NeurIPS 2023 Datasets and Benchmarks Track] LAMM: Multi-Modal Large Language Models and Applications as AI Agentsβ311Updated last year
- Theia: Distilling Diverse Vision Foundation Models for Robot Learningβ226Updated 3 weeks ago
- [ECCV 2024 Oral] Code for paper: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Languaβ¦β413Updated 3 months ago
- When do we not need larger vision models?β388Updated 2 months ago