Ucas-HaoranWei/Vary-toy

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/Ucas-HaoranWei/Vary-toy)

Ucas-HaoranWei / Vary-toy

Official code implementation of Vary-toy (Small Language Model Meets with Reinforced Vision Vocabulary)

☆630

Alternatives and similar repositories for Vary-toy

Users that are interested in Vary-toy are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

Ucas-HaoranWei / Vary-family
View on GitHub
☆57Jan 23, 2024Updated 2 years ago
Ucas-HaoranWei / Vary
View on GitHub
[ECCV 2024] Official code implementation of Vary: Scaling Up the Vision Vocabulary of Large Vision Language Models.
☆1,889Dec 30, 2024Updated last year
Ucas-HaoranWei / Vary-tiny-600k
View on GitHub
Vary-tiny codebase upon LAVIS （for training from scratch）and a PDF image-text pairs data (about 600k including English/Chinese)
☆89Sep 21, 2024Updated last year
ucaslcl / Fox
View on GitHub
official code for "Fox: Focus Anywhere for Fine-grained Multi-page Document Understanding"
☆196May 31, 2024Updated 2 years ago
LingyvKong / OneChart
View on GitHub
[ACM'MM 2024 Oral] Official code for "OneChart: Purify the Chart Structural Extraction via One Auxiliary Token"
☆266Apr 14, 2025Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
PKU-YuanGroup / MoE-LLaVA
View on GitHub
【TMM 2025🔥】 Mixture-of-Experts for Large Vision-Language Models
☆2,322Jul 15, 2025Updated last year
Ucas-HaoranWei / Aircraft-KP
View on GitHub
Keypoint dataset for airplane
☆10Dec 28, 2019Updated 6 years ago
TinyLLaVA / TinyLLaVA_Factory
View on GitHub
A Framework of Small-scale Large Multimodal Models
☆992Updated this week
MILVLG / imp
View on GitHub
a family of highly capabale yet efficient large multimodal models
☆194Aug 23, 2024Updated last year
Meituan-AutoML / MobileVLM
View on GitHub
Strong and Open Vision Language Assistant for Mobile Devices
☆1,364Apr 15, 2024Updated 2 years ago
Yuliang-Liu / Monkey
View on GitHub
Monkey (LMM): Image Resolution and Text Label Are Important Things for Large Multi-modal Models (CVPR 2024 Highlight)
☆1,949Jun 2, 2026Updated last month
DLYuanGod / TinyGPT-V
View on GitHub
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
☆1,316Feb 5, 2026Updated 5 months ago
JIA-Lab-research / MGM
View on GitHub
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
☆3,330May 4, 2024Updated 2 years ago
QwenLM / Qwen-VL
View on GitHub
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
☆6,711Aug 7, 2024Updated last year
End-to-end encrypted cloud storage - Proton Drive • Ad
Special offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
OpenGVLab / InternVL
View on GitHub
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
☆10,097Sep 22, 2025Updated 10 months ago
zai-org / CogVLM
View on GitHub
a state-of-the-art-level open visual language model | 多模态预训练模型
☆6,745May 29, 2024Updated 2 years ago
BAAI-DCAI / Bunny
View on GitHub
A family of lightweight multimodal models.
☆1,052Nov 18, 2024Updated last year
InternLM / InternLM-XComposer
View on GitHub
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
☆2,921May 26, 2025Updated last year
OpenBMB / MiniCPM
View on GitHub
MiniCPM5-1B: A SOTA 1B on-device LLM, small yet powerful.
☆9,986Jun 20, 2026Updated last month
JIA-Lab-research / LLaMA-VID
View on GitHub
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)
☆861Jul 29, 2024Updated last year
LinkSoul-AI / Chinese-LLaVA
View on GitHub
支持中英文双语视觉-文本对话的开源可商用多模态模型。
☆378Sep 23, 2023Updated 2 years ago
haotian-liu / LLaVA
View on GitHub
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
☆24,932Aug 12, 2024Updated last year
FreedomIntelligence / ALLaVA
View on GitHub
Harnessing 1.4M GPT4V-synthesized Data for A Lite Vision-Language Model
☆281Jun 25, 2024Updated 2 years ago
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
LLaVA-VL / LLaVA-NeXT
View on GitHub
☆4,710Jun 15, 2026Updated last month
thunlp / LLaVA-UHD
View on GitHub
LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs
☆423Jul 6, 2026Updated 2 weeks ago
X-PLUG / mPLUG-Owl
View on GitHub
mPLUG-Owl: The Powerful Multi-modal Large Language Model Family
☆2,535Apr 2, 2025Updated last year
shenyunhang / APE
View on GitHub
[CVPR 2024] Aligning and Prompting Everything All at Once for Universal Visual Perception
☆609May 8, 2024Updated 2 years ago
UCSB-AI / MiniGPT-5
View on GitHub
Official implementation of paper "MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens"
☆866May 8, 2025Updated last year
LLaVA-VL / LLaVA-Plus-Codebase
View on GitHub
LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills
☆769Feb 1, 2024Updated 2 years ago
OpenBMB / MiniCPM-V
View on GitHub
A Pocket-Sized MLLM for Ultra-Efficient Image and Video Understanding on Your Phone
☆25,965Jun 25, 2026Updated 3 weeks ago
NExT-ChatV / NExT-Chat
View on GitHub
The code of the paper "NExT-Chat: An LMM for Chat, Detection and Segmentation".
☆253Feb 5, 2024Updated 2 years ago
X-PLUG / mPLUG-DocOwl
View on GitHub
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
☆2,408May 30, 2025Updated last year
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
PJLab-ADG / limsim-plus
View on GitHub
Academic page for LimSim++
☆11Mar 19, 2024Updated 2 years ago
salesforce / LAVIS
View on GitHub
LAVIS - A One-stop Library for Language-Vision Intelligence
☆11,255Jun 2, 2026Updated last month
mbzuai-oryx / groundingLMM
View on GitHub
[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses tha…
☆963Aug 5, 2025Updated 11 months ago
Yuliang-Liu / MultimodalOCR
View on GitHub
On the Hidden Mystery of OCR in Large Multimodal Models (OCRBench)
☆870Updated this week
Ucas-HaoranWei / GOT-OCR2.0
View on GitHub
Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
☆8,155Feb 10, 2025Updated last year
Ucas-HaoranWei / Slow-Perception
View on GitHub
Official code implementation of Slow Perception:Let's Perceive Geometric Figures Step-by-step
☆161Jul 28, 2025Updated 11 months ago
BradyFU / Awesome-Multimodal-Large-Language-Models
View on GitHub
Latest Advances on Multimodal Large Language Models
☆17,954Jul 2, 2026Updated 2 weeks ago