apple / ml-fastvlmLinks
This repository contains the official implementation of "FastVLM: Efficient Vision Encoding for Vision Language Models" - CVPR 2025
☆5,717Updated 3 months ago
Alternatives and similar repositories for ml-fastvlm
Users that are interested in ml-fastvlm are comparing it to the libraries listed below
Sorting:
- The simplest, fastest repository for training/finetuning small-sized VLMs.☆3,966Updated this week
- Everything about the SmolLM and SmolVLM family of models☆3,168Updated 3 weeks ago
- Renderer for the harmony response format to be used with gpt-oss☆3,699Updated 2 weeks ago
- A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speec…☆2,616Updated last week
- Kimi K2 is the large language model series developed by Moonshot AI team☆7,897Updated last week
- [CVPR 2025] Magma: A Foundation Model for Multimodal AI Agents☆1,793Updated 3 months ago
- Run LLMs with MLX☆1,763Updated this week
- Open-source unified multimodal model☆4,925Updated last week
- PyTorch code and models for VJEPA2 self-supervised learning from video.☆2,124Updated this week
- OmniGen2: Exploration to Advanced Multimodal Generation.☆3,771Updated last month
- Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and pe…☆3,570Updated 2 months ago
- Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation☆4,204Updated 2 months ago
- Real-time webcam demo with SmolVLM and llama.cpp server☆4,594Updated 3 months ago
- Wan: Open and Advanced Large-Scale Video Generative Models☆4,655Updated last week
- MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.☆1,610Updated this week
- This repository contains the official implementation of the research papers, "MobileCLIP" CVPR 2024 and "MobileCLIP2" TMLR August 2025☆1,073Updated this week
- New repo collection for NVIDIA Cosmos: https://github.com/nvidia-cosmos☆8,060Updated 2 months ago
- RF-DETR is a real-time object detection model architecture developed by Roboflow, SOTA on COCO and designed for fine-tuning.☆2,872Updated last week
- A unified library for object tracking featuring clean room re-implementations of leading multi-object tracking algorithms☆2,083Updated this week
- Official repository for LTX-Video☆7,827Updated last month
- Text-audio foundation model from Boson AI☆7,071Updated 3 weeks ago
- MAGI-1: Autoregressive Video Generation at Scale☆3,459Updated 2 months ago
- [ICCV 2025] Implementation for Describe Anything: Detailed Localized Image and Video Captioning☆1,318Updated 2 months ago
- Hibiki is a model for streaming speech translation (also known as simultaneous translation). Unlike offline translation—where one waits f…☆1,259Updated 4 months ago
- Qwen-Image is a powerful image generation foundation model capable of complex text rendering and precise image editing.☆4,595Updated last week
- Fast and accurate automatic speech recognition (ASR) for edge devices☆2,844Updated 3 months ago
- ☆5,799Updated last week
- SpatialLM: Training Large Language Models for Structured Indoor Modeling☆3,903Updated this week
- The official repo of MiniMax-Text-01 and MiniMax-VL-01, large-language-model & vision-language-model based on Linear Attention☆3,132Updated last month
- Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audi…☆8,825Updated last week