apple / ml-fastvlmLinks
This repository contains the official implementation of "FastVLM: Efficient Vision Encoding for Vision Language Models" - CVPR 2025
☆7,180Updated 8 months ago
Alternatives and similar repositories for ml-fastvlm
Users that are interested in ml-fastvlm are comparing it to the libraries listed below
Sorting:
- Run LLMs with MLX☆3,412Updated last week
- The simplest, fastest repository for training/finetuning small-sized VLMs.☆4,589Updated 3 months ago
- Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, im…☆3,326Updated 3 weeks ago
- MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.☆2,067Updated this week
- Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.☆18,066Updated this week
- Everything about the SmolLM and SmolVLM family of models☆3,579Updated 2 weeks ago
- This repository contains the official implementation of the research papers, "MobileCLIP" CVPR 2024 and "MobileCLIP2" TMLR August 2025☆1,399Updated 3 months ago
- Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and pe…☆3,903Updated 7 months ago
- [NeurIPS 2025] SpatialLM: Training Large Language Models for Structured Indoor Modeling☆4,204Updated 4 months ago
- Text-audio foundation model from Boson AI☆7,879Updated last week
- Renderer for the harmony response format to be used with gpt-oss☆4,159Updated last month
- Open-source unified multimodal model☆5,601Updated 3 months ago
- RF-DETR is a real-time object detection and segmentation model architecture developed by Roboflow, SOTA on COCO and designed for fine-tun…☆5,301Updated this week
- Qwen-Image is a powerful image generation foundation model capable of complex text rendering and precise image editing.☆7,153Updated last month
- [CVPR 2025] Magma: A Foundation Model for Multimodal AI Agents☆1,890Updated last week
- New repo collection for NVIDIA Cosmos: https://github.com/nvidia-cosmos☆8,088Updated 3 weeks ago
- [ICCV 2025] Implementation for Describe Anything: Detailed Localized Image and Video Captioning☆1,446Updated 7 months ago
- MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model.☆3,044Updated 6 months ago
- A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speec…☆3,595Updated this week
- Reference PyTorch implementation and models for DINOv3☆9,393Updated 2 months ago
- State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!☆2,113Updated last week
- Witness the aha moment of VLM with less than $3.☆4,025Updated 8 months ago
- Multilingual Document Layout Parsing in a Single Vision-Language Model☆7,090Updated last month
- The repository provides code for running inference with the Meta Segment Anything Audio Model (SAM-Audio), links for downloading the trai…☆3,183Updated 3 weeks ago
- chat with private and local large language models☆2,195Updated 8 months ago
- Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation☆4,460Updated 7 months ago
- Real-time webcam demo with SmolVLM and llama.cpp server☆5,505Updated 8 months ago
- MAGI-1: Autoregressive Video Generation at Scale☆3,635Updated 7 months ago
- Embedding Atlas is a tool that provides interactive visualizations for large embeddings. It allows you to visualize, cross-filter, and se…☆4,551Updated last week
- A TTS model capable of generating ultra-realistic dialogue in one pass.☆19,064Updated 2 months ago