apple / ml-fastvlmLinks
This repository contains the official implementation of "FastVLM: Efficient Vision Encoding for Vision Language Models" - CVPR 2025
☆7,029Updated 7 months ago
Alternatives and similar repositories for ml-fastvlm
Users that are interested in ml-fastvlm are comparing it to the libraries listed below
Sorting:
- Everything about the SmolLM and SmolVLM family of models☆3,445Updated 3 weeks ago
- Real-time webcam demo with SmolVLM and llama.cpp server☆4,838Updated 7 months ago
- The simplest, fastest repository for training/finetuning small-sized VLMs.☆4,351Updated last month
- Open-source unified multimodal model☆5,444Updated last month
- Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.☆16,985Updated 2 weeks ago
- This repository contains the official implementation of the research papers, "MobileCLIP" CVPR 2024 and "MobileCLIP2" TMLR August 2025☆1,332Updated 2 months ago
- Text-audio foundation model from Boson AI☆7,715Updated 2 months ago
- StarVector is a foundation model for SVG generation that transforms vectorization into a code generation task. Using a vision-language mo…☆4,136Updated last month
- Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation☆4,376Updated 5 months ago
- MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.☆1,910Updated last week
- Examples using MLX Swift☆2,333Updated last week
- Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, im…☆3,062Updated 2 months ago
- ☆6,046Updated 3 months ago
- A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speec…☆3,010Updated this week
- [CVPR 2025] Magma: A Foundation Model for Multimodal AI Agents☆1,872Updated 2 months ago
- Run LLMs with MLX☆3,003Updated this week
- RF-DETR is a real-time object detection and segmentation model architecture developed by Roboflow, SOTA on COCO and designed for fine-tun…☆4,624Updated last month
- Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and pe…☆3,828Updated 6 months ago
- ☆4,569Updated 6 months ago
- The official repo of MiniMax-Text-01 and MiniMax-VL-01, large-language-model & vision-language-model based on Linear Attention☆3,248Updated 5 months ago
- A unified library for object tracking featuring clean room re-implementations of leading multi-object tracking algorithms☆2,184Updated last week
- Toolkit for linearizing PDFs for LLM datasets/training☆16,165Updated last week
- gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI☆19,360Updated last month
- Contexts Optical Compression☆21,287Updated last month
- A course of learning LLM inference serving on Apple Silicon for systems engineers: build a tiny vLLM + Qwen.☆3,466Updated last month
- Embedding Atlas is a tool that provides interactive visualizations for large embeddings. It allows you to visualize, cross-filter, and se…☆4,445Updated last week
- MAGI-1: Autoregressive Video Generation at Scale☆3,576Updated 5 months ago
- State-of-the-art TTS model under 25MB 😻☆9,218Updated 3 months ago
- Renderer for the harmony response format to be used with gpt-oss☆4,050Updated last month
- [ICCV 2025] Implementation for Describe Anything: Detailed Localized Image and Video Captioning☆1,423Updated 5 months ago