huizhang0110 / catvisionLinks

A multimodal large-scale model, which performs close to the closed-source Qwen-VL-PLUS on many datasets and significantly surpasses the performance of the open-source model Qwen-VL-7B-Chat.

☆14

Alternatives and similar repositories for catvision

Users that are interested in catvision are comparing it to the libraries listed below

Sorting:

TencentARC-QQ / QA-CLIP
Chinese CLIP models with SOTA performance.
☆59Updated 2 years ago
Letian2003 / MM_INF
An efficient multi-modal instruction-following data synthesis tool and the official implementation of Oasis https://arxiv.org/abs/2503.08…
☆32Updated 4 months ago
Xiaomeng-Yang / STR_benchmark_cleansed
☆14Updated 2 years ago
onealwj / MVLT
PyTorch implementation of BMVC2022 paper Masked Vision-Language Transformers for Scene Text Recognition
☆29Updated 2 years ago
Ucas-HaoranWei / Vary-family
☆57Updated last year
360CVGroup / 360VL
Our 2nd-gen LMM
☆34Updated last year
iFLYTEK-CV / EDU-CHEMC
A handwritten Chemical Structure Image data set named EDU-CHEMC, which consists of totally 52,987 handwritten molecular structure images …
☆14Updated 5 months ago
PkuDavidGuan / TIoU-metric-python3
TIoU metric in python3. Forked from https://github.com/Yuliang-Liu/TIoU-metric.
☆26Updated 5 years ago
PkuDavidGuan / CurvedSynthText
☆41Updated 5 years ago
lucasjinreal / MLLM_Factory
A Dead Simple and Modularized Multi-Modal Training and Finetune Framework. Compatible to any LLaVA/Flamingo/QwenVL/MiniGemini etc series …
☆19Updated last year
namtuanly / WikiTableSet
WikiTableSet: A largest publicly available image-based table recognition dataset in three languages built from Wikipedia
☆31Updated 4 months ago
Pay20Y / PIMNet
☆16Updated 3 years ago
thu-ml / zh-clip
☆72Updated 2 years ago
JianqiangWan / VLPT-STD
Vision-Language Pre-Training for Boosting Scene Text Detectors (CVPR2022)
☆12Updated 3 years ago
PCIResearch / TransCore-M
Large Multimodal Model
☆15Updated last year
BytedanceDouyinContent / SAIL-VL2
The SAIL-VL2 series model developed by the BytedanceDouyinContent Group
☆71Updated last month
opendatalab / MLLM-DataEngine
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
☆48Updated last year
nengelmann / Fuyu-8B---Exploration
Exploration of the multi modal fuyu-8b model of Adept. 🤓 🔍
☆27Updated last year
BIGBALLON / UME-Search
Toward Universal Multimodal Embedding
☆64Updated 3 months ago
weijiawu / TransDETR
[IJCV 2024] TransDETR: End-to-end Video Text Spotting with Transformer
☆104Updated last year
BADBADBADBOY / baipiaoOCR
convert paddleOCR to torchOCR, ppocr-v3,ppocr-v4, onnx, openvino
☆33Updated 2 years ago
bytedance / MTVQA
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering. A comprehensive evaluation of multimodal large model multilingua…
☆63Updated 5 months ago
Yuliang-Liu / VimTS
VimTS: A Unified Video and Image Text Spotter
☆78Updated 11 months ago
deepglint / RealSyn
[ACM MM2025] The official repository for the RealSyn dataset
☆37Updated 3 months ago
scenarios / WeMM
☆87Updated last year
whlscut / DocLayLLM
[CVPR 2025] DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding
☆22Updated 7 months ago
TencentARC / BTS
BTS: A Bi-lingual Benchmark for Text Segmentation in the Wild
☆32Updated last year
linrongc / solution_youtube8m_v3
Solution of the 3rd place in the 3rd YouTube-8M Video Understanding Challenge
☆16Updated 5 years ago
MonolithFoundation / Bumblebee
A Simple MLLM Surpassed QwenVL-Max with OpenSource Data Only in 14B LLM.
☆38Updated last year
zhaominyiz / STIRER
STIRER: A Unified Model for Low-Resolution Scene Text Image Recovery and Recognition -- ACMMM 2023
☆14Updated 10 months ago