apple / ml-fastvlmLinks
This repository contains the official implementation of "FastVLM: Efficient Vision Encoding for Vision Language Models" - CVPR 2025
☆6,746Updated 5 months ago
Alternatives and similar repositories for ml-fastvlm
Users that are interested in ml-fastvlm are comparing it to the libraries listed below
Sorting:
- Run LLMs with MLX☆2,594Updated this week
- A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speec…☆2,736Updated 2 weeks ago
- The simplest, fastest repository for training/finetuning small-sized VLMs.☆4,100Updated last month
- MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.☆1,685Updated this week
- Open-source unified multimodal model☆5,118Updated last month
- Real-time webcam demo with SmolVLM and llama.cpp server☆4,777Updated 5 months ago
- Everything about the SmolLM and SmolVLM family of models☆3,300Updated 3 weeks ago
- Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audi…☆8,978Updated last week
- Fast and accurate automatic speech recognition (ASR) for edge devices☆2,903Updated last month
- This repository contains the official implementation of the research papers, "MobileCLIP" CVPR 2024 and "MobileCLIP2" TMLR August 2025☆1,245Updated 3 weeks ago
- RF-DETR is a real-time object detection and segmentation model architecture developed by Roboflow, SOTA on COCO and designed for fine-tun…☆3,582Updated last week
- Examples using MLX Swift☆2,241Updated this week
- [NeurIPS 2025] SpatialLM: Training Large Language Models for Structured Indoor Modeling☆4,026Updated 2 weeks ago
- OmniGen2: Exploration to Advanced Multimodal Generation.☆3,890Updated 2 weeks ago
- Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, im…☆2,488Updated 2 weeks ago
- PyTorch code and models for VJEPA2 self-supervised learning from video.☆2,269Updated last month
- MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model.☆2,915Updated 3 months ago
- [CVPR 2025] Magma: A Foundation Model for Multimodal AI Agents☆1,821Updated last week
- A unified library for object tracking featuring clean room re-implementations of leading multi-object tracking algorithms☆2,146Updated last week
- Qwen-Image is a powerful image generation foundation model capable of complex text rendering and precise image editing.☆5,583Updated last week
- Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.☆13,936Updated this week
- Embedding Atlas is a tool that provides interactive visualizations for large embeddings. It allows you to visualize, cross-filter, and se…☆3,927Updated this week
- Kyutai's Speech-To-Text and Text-To-Speech models based on the Delayed Streams Modeling framework.☆2,436Updated 3 weeks ago
- SoTA open-source TTS☆13,774Updated 2 weeks ago
- Get started with building Fullstack Agents using Gemini 2.5 and LangGraph☆17,013Updated last month
- Kernels & AI inference engine for phone chips☆3,423Updated this week
- Renderer for the harmony response format to be used with gpt-oss☆3,879Updated last month
- Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and pe…☆3,697Updated 4 months ago
- Wan: Open and Advanced Large-Scale Video Generative Models☆9,774Updated 3 weeks ago
- Official inference framework for 1-bit LLMs☆24,130Updated 4 months ago