[ICCV 2025] Implementation for Describe Anything: Detailed Localized Image and Video Captioning
☆1,479Jun 26, 2025Updated 9 months ago
Alternatives and similar repositories for describe-anything
Users that are interested in describe-anything are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!☆2,249Updated this week
- Official repository for "AM-RADIO: Reduce All Domains Into One"☆1,759Apr 9, 2026Updated last week
- Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving stat…☆1,570Jun 14, 2025Updated 10 months ago
- Official Repo For Pixel-LLM Codebase: Sa2VA (Arxiv-25), SAMTok (CVPR-26), VRT, SaSaSa2VA (1-st solution for LSVOS)☆1,585Feb 27, 2026Updated last month
- Open-source unified multimodal model☆5,831Oct 27, 2025Updated 5 months ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained mode…☆18,968Apr 7, 2026Updated last week
- Solve Visual Understanding with Reinforced VLMs☆5,939Mar 12, 2026Updated last month
- Official implementation of BLIP3o-Series☆1,648Nov 29, 2025Updated 4 months ago
- ☆4,638Updated this week
- MAGI-1: Autoregressive Video Generation at Scale☆3,677Jun 17, 2025Updated 10 months ago
- Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.☆18,961Jan 30, 2026Updated 2 months ago
- Official repo and evaluation implementation of VSI-Bench☆695Aug 5, 2025Updated 8 months ago
- [NeurIPS 2025] SpatialLM: Training Large Language Models for Structured Indoor Modeling☆4,508Sep 26, 2025Updated 6 months ago
- Scaling Vision Pre-Training to 4K Resolution☆223Jan 4, 2026Updated 3 months ago
- Wordpress hosting with auto-scaling - Free Trial • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and clou…☆3,788Mar 12, 2026Updated last month
- Grounded SAM 2: Ground and Track Anything in Videos with Grounding DINO, Florence-2 and SAM 2☆3,442Nov 11, 2025Updated 5 months ago
- This repository provides the code and model checkpoints for AIMv1 and AIMv2 research projects.☆1,414Aug 4, 2025Updated 8 months ago
- Stable Virtual Camera: Generative View Synthesis with Diffusion Models☆1,589Mar 3, 2026Updated last month
- Reference PyTorch implementation and models for DINOv3☆10,145Mar 30, 2026Updated 2 weeks ago
- [ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"☆10,010Aug 12, 2024Updated last year
- [CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型☆9,972Sep 22, 2025Updated 6 months ago
- A SOTA open-source image editing model, which aims to provide comparable performance against the closed-source models like GPT-4o and Gem…☆2,182Dec 29, 2025Updated 3 months ago
- Next-Token Prediction is All You Need☆2,396Jan 12, 2026Updated 3 months ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- Open source repo for Locate 3D Model, 3D-JEPA and Locate 3D Dataset☆431Jun 3, 2025Updated 10 months ago
- Official repository of 'Visual-RFT: Visual Reinforcement Fine-Tuning' & 'Visual-ARFT: Visual Agentic Reinforcement Fine-Tuning'’☆2,261Oct 29, 2025Updated 5 months ago
- [ICLR & NeurIPS 2025] Repository for Show-o series, One Single Transformer to Unify Multimodal Understanding and Generation.☆1,910Jan 8, 2026Updated 3 months ago
- [CVPR 2025] Prompt Depth Anything☆1,096Jan 29, 2026Updated 2 months ago
- Frontier Multimodal Foundation Models for Image and Video Understanding☆1,140Aug 14, 2025Updated 8 months ago
- [CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses tha…☆951Aug 5, 2025Updated 8 months ago
- DINO-X: The World's Top-Performing Vision Model for Open-World Object Detection and Understanding☆1,362Jul 23, 2025Updated 8 months ago
- [CVPR 2025 Best Paper Award] VGGT: Visual Geometry Grounded Transformer☆12,866Mar 3, 2026Updated last month
- Cambrian-1 is a family of multimodal LLMs with a vision-centric design.☆1,993Nov 7, 2025Updated 5 months ago
- Wordpress hosting with auto-scaling - Free Trial • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- Project Page for "LISA: Reasoning Segmentation via Large Language Model"☆2,625Feb 16, 2025Updated last year
- ☆1,054May 14, 2025Updated 11 months ago
- Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and …☆17,513Sep 5, 2024Updated last year
- The code for PixelRefer & VideoRefer☆345Nov 16, 2025Updated 5 months ago
- [CVPR 2025] Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass☆1,535May 7, 2025Updated 11 months ago
- PyTorch code and models for the DINOv2 self-supervised learning method.☆12,698Apr 8, 2026Updated last week
- LLM2CLIP significantly improves already state-of-the-art CLIP models.☆646Feb 1, 2026Updated 2 months ago