facebookresearch / vjepa2Links
PyTorch code and models for VJEPA2 self-supervised learning from video.
☆1,331Updated this week
Alternatives and similar repositories for vjepa2
Users that are interested in vjepa2 are comparing it to the libraries listed below
Sorting:
- Cosmos-Reason1 models understand the physical common sense and generate appropriate embodied decisions in natural language through long c…☆510Updated this week
- State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!☆1,270Updated 3 weeks ago
- Implementation for Describe Anything: Detailed Localized Image and Video Captioning☆1,170Updated last month
- Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving stat…☆1,247Updated last week
- MMaDA - Open-Sourced Multimodal Large Diffusion Language Models☆1,109Updated last week
- A suite of image and video neural tokenizers☆1,638Updated 4 months ago
- Official repo and evaluation implementation of VSI-Bench☆512Updated 3 months ago
- [CVPR 2025] Magma: A Foundation Model for Multimodal AI Agents☆1,712Updated 3 weeks ago
- Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities☆893Updated 2 months ago
- Code for the Molmo Vision-Language Model☆506Updated 6 months ago
- Cosmos-Transfer1 is a world-to-world transfer model designed to bridge the perceptual divide between simulated and real-world environment…☆505Updated last week
- This repo contains the code for the paper "Intuitive physics understanding emerges fromself-supervised pretraining on natural videos"☆164Updated 4 months ago
- Implementation of π₀, the robotic foundation model architecture proposed by Physical Intelligence☆441Updated last week
- Heterogeneous Pre-trained Transformer (HPT) as Scalable Policy Learner.☆501Updated 6 months ago
- Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models☆698Updated 2 months ago
- Video-R1: Reinforcing Video Reasoning in MLLMs [🔥the first paper to explore R1 for video]☆569Updated 3 weeks ago
- Dream 7B, a large diffusion language model☆764Updated last week
- Next-Token Prediction is All You Need☆2,149Updated 3 months ago
- The first behavioral foundation model to control a virtual physics-based humanoid agent for a wide range of whole-body tasks.☆614Updated last week
- Explore the Multimodal “Aha Moment” on 2B Model☆592Updated 3 months ago
- A flexible and efficient codebase for training visually-conditioned language models (VLMs)☆712Updated 11 months ago
- Compose multimodal datasets 🎹☆413Updated last week
- Pytorch implementation of Transfusion, "Predict the Next Token and Diffuse Images with One Multi-Modal Model", from MetaAI☆1,157Updated this week
- GRUtopia: Dream General Robots in a City at Scale☆845Updated 2 weeks ago
- Continuous Thought Machines, because thought takes time and reasoning is a process.☆1,026Updated 3 weeks ago
- Re-implementation of pi0 vision-language-action (VLA) model from Physical Intelligence☆952Updated 4 months ago
- A fork to add multimodal model training to open-r1☆1,306Updated 4 months ago
- Official repository for "AM-RADIO: Reduce All Domains Into One"☆1,211Updated 2 weeks ago
- A generative and self-guided robotic agent that endlessly propose and master new skills.☆1,003Updated last year
- ☆1,246Updated 6 months ago