lucasjinreal / Namo-R1
A CPU Realtime VLM in 500M. Surpassed Moondream2 and SmolVLM. Training from scratch with ease.
☆183Updated last month
Alternatives and similar repositories for Namo-R1:
Users that are interested in Namo-R1 are comparing it to the libraries listed below
- Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources☆169Updated 2 weeks ago
- Code for ChatRex: Taming Multimodal LLM for Joint Perception and Understanding☆182Updated 2 months ago
- The code of the paper "NExT-Chat: An LMM for Chat, Detection and Segmentation".☆238Updated last year
- [ICCV2023] TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance☆91Updated 9 months ago
- ☆173Updated 2 months ago
- LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer☆374Updated this week
- [CVPR'25 highlight] RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness☆351Updated last month
- [CVPR 2024] Official implementation of "ViTamin: Designing Scalable Vision Models in the Vision-language Era"☆203Updated 10 months ago
- Quick exploration into fine tuning florence 2☆308Updated 7 months ago
- Baichuan-Omni: Towards Capable Open-source Omni-modal LLM 🌊☆267Updated 2 months ago
- [EMNLP 2024] RWKV-CLIP: A Robust Vision-Language Representation Learner☆132Updated 3 months ago
- ☆354Updated 2 months ago
- ✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models☆155Updated 3 months ago
- official code for "Fox: Focus Anywhere for Fine-grained Multi-page Document Understanding"☆144Updated 10 months ago
- R1-onevision, a visual language model capable of deep CoT reasoning.☆500Updated last week
- LLM2CLIP makes SOTA pretrained CLIP model more SOTA ever.☆506Updated 3 weeks ago
- [ICLR 2025 Spotlight] OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text☆338Updated last month
- Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data.☆229Updated last month
- A Token-level Text Image Foundation Model for Document Understanding☆89Updated 3 weeks ago
- Train InternViT-6B in MMSegmentation and MMDetection with DeepSpeed☆88Updated 5 months ago
- ☆368Updated last month
- Research Code for Multimodal-Cognition Team in Ant Group☆142Updated 9 months ago
- This is the first paper to explore how to effectively use RL for MLLMs and introduce Vision-R1, a reasoning MLLM that leverages cold-sta…☆510Updated last week
- [CVPR 2025 Highlight] The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for C…☆240Updated 3 months ago
- This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"☆178Updated 3 months ago
- Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines☆122Updated 5 months ago
- Official implementation of OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion☆310Updated last month
- ☆147Updated 2 months ago
- Official repo of Griffon series including v1(ECCV 2024), v2, and G☆196Updated 3 weeks ago
- MuLan: Adapting Multilingual Diffusion Models for 110+ Languages (无需额外训练为任意扩散模型支持多语言能力)☆135Updated 2 months ago