facebookresearch / perception_models
State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!
☆1,071Updated last week
Alternatives and similar repositories for perception_models
Users that are interested in perception_models are comparing it to the libraries listed below
Sorting:
- Official repository for "AM-RADIO: Reduce All Domains Into One"☆1,149Updated 3 weeks ago
- Implementation for Describe Anything: Detailed Localized Image and Video Captioning☆1,065Updated last week
- A suite of image and video neural tokenizers☆1,622Updated 3 months ago
- Code for the Molmo Vision-Language Model☆413Updated 5 months ago
- This repository provides the code and model checkpoints for AIMv1 and AIMv2 research projects.☆1,281Updated 3 weeks ago
- Official repo and evaluation implementation of VSI-Bench☆481Updated 2 months ago
- Code for Scaling Language-Free Visual Representation Learning (WebSSL)☆244Updated 3 weeks ago
- Famous Vision Language Models and Their Architectures☆824Updated 2 months ago
- OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning☆214Updated this week
- This repo contains the code for 1D tokenizer and generator☆868Updated last month
- Compose multimodal datasets 🎹☆371Updated 3 weeks ago
- When do we not need larger vision models?☆392Updated 3 months ago
- Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation☆1,750Updated 9 months ago
- Pytorch implementation of Transfusion, "Predict the Next Token and Diffuse Images with One Multi-Modal Model", from MetaAI☆1,093Updated this week
- Video-R1: Reinforcing Video Reasoning in MLLMs [🔥the first paper to explore R1 for video]☆515Updated this week
- [NeurIPS 2024] Code release for "Segment Anything without Supervision"☆464Updated this week
- DINO-X: The World's Top-Performing Vision Model for Open-World Object Detection and Understanding☆1,033Updated 3 weeks ago
- Eagle Family: Exploring Model Designs, Data Recipes and Training Strategies for Frontier-Class Multimodal LLMs☆772Updated 2 weeks ago
- Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models☆595Updated last month
- SEED-Voken: A Series of Powerful Visual Tokenizers☆878Updated 2 months ago
- Grounding DINO 1.5: IDEA Research's Most Capable Open-World Object Detection Model Series☆943Updated 3 months ago
- A flexible and efficient codebase for training visually-conditioned language models (VLMs)☆675Updated 10 months ago
- LLM2CLIP makes SOTA pretrained CLIP model more SOTA ever.☆513Updated last month
- [ICLR 2025] Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.☆1,396Updated 2 weeks ago
- This repo contains the code for the paper "Intuitive physics understanding emerges fromself-supervised pretraining on natural videos"☆154Updated 2 months ago
- Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving stat…☆529Updated this week
- Efficient Track Anything☆541Updated 4 months ago
- Next-Token Prediction is All You Need☆2,121Updated last month
- ☆515Updated 6 months ago
- Hiera: A fast, powerful, and simple hierarchical vision transformer.☆982Updated last year