facebookresearch / perception_models
State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!
☆548Updated this week
Alternatives and similar repositories for perception_models:
Users that are interested in perception_models are comparing it to the libraries listed below
- Official repository for "AM-RADIO: Reduce All Domains Into One"☆1,117Updated last week
- [ICLR 2025] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation☆299Updated this week
- Implementation for Describe Anything: Detailed Localized Image and Video Captioning☆390Updated this week
- Scaling Vision Pre-Training to 4K Resolution☆124Updated 3 weeks ago
- Code for the Molmo Vision-Language Model☆377Updated 4 months ago
- This repo contains the code for the paper "Intuitive physics understanding emerges fromself-supervised pretraining on natural videos"☆142Updated 2 months ago
- Official Implementation for our NeurIPS 2024 paper, "Don't Look Twice: Run-Length Tokenization for Faster Video Transformers".☆206Updated 3 weeks ago
- When do we not need larger vision models?☆388Updated 2 months ago
- Code for Scaling Language-Free Visual Representation Learning (WebSSL)☆245Updated this week
- Official repo and evaluation implementation of VSI-Bench☆463Updated last month
- Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"☆397Updated last month
- This repo contains the code for 1D tokenizer and generator☆838Updated last month
- Python Library to evaluate VLM models' robustness across diverse benchmarks☆201Updated this week
- SEED-Voken: A Series of Powerful Visual Tokenizers☆868Updated 2 months ago
- [NeurIPS 2024] Code release for "Segment Anything without Supervision"☆461Updated 6 months ago
- Compose multimodal datasets 🎹☆351Updated this week
- EVE Series: Encoder-Free Vision-Language Models from BAAI☆322Updated last month
- An open source implementation of CLIP (With TULIP Support)☆132Updated last month
- [ECCV 2024] Official PyTorch implementation of RoPE-ViT "Rotary Position Embedding for Vision Transformer"☆315Updated 4 months ago
- [CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts☆319Updated 9 months ago
- Video-R1: Reinforcing Video Reasoning in MLLMs [🔥the first paper to explore R1 for video]☆469Updated this week
- Cosmos-Transfer1 is a world-to-world transfer model designed to bridge the perceptual divide between simulated and real-world environment…☆366Updated this week
- [ECCV2024 Oral🔥] Official Implementation of "GiT: Towards Generalist Vision Transformer through Universal Language Interface"☆346Updated 3 months ago
- Code for MetaMorph Multimodal Understanding and Generation via Instruction Tuning☆123Updated this week
- [CVPR 2025 Highlight] The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for C…☆240Updated 3 months ago
- A suite of image and video neural tokenizers☆1,614Updated 2 months ago
- Official implementation of paper: SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training☆265Updated 2 months ago
- LLM2CLIP makes SOTA pretrained CLIP model more SOTA ever.☆506Updated last month
- Project Page For "Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement"☆325Updated 2 weeks ago
- [ICLR 2025] Autoregressive Video Generation without Vector Quantization☆477Updated this week