Tencent-Hunyuan / HunyuanOCRLinks
☆1,495Updated 2 weeks ago
Alternatives and similar repositories for HunyuanOCR
Users that are interested in HunyuanOCR are comparing it to the libraries listed below
Sorting:
- Visual Causal Flow☆1,306Updated this week
- ☆925Updated 2 weeks ago
- ☆869Updated 3 months ago
- GLM-4.6V/4.5V/4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning☆2,145Updated this week
- Cook up amazing multimodal AI applications effortlessly with MiniCPM-o☆242Updated last month
- MAI-UI: Real-World Centric Foundation GUI Agents ranging from 2B to 235B☆1,560Updated 2 weeks ago
- ☆326Updated 2 months ago
- An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)☆1,849Updated 5 months ago
- [CVPR 2025] A Comprehensive Benchmark for Document Parsing and Evaluation☆1,409Updated last month
- Qwen-Image-Layered: Layered Decomposition for Inherent Editablity☆1,508Updated last month
- ☆456Updated last month
- ☆681Updated last month
- Official Python toolkit for the Qwen3-ASR API. Parallel high‑throughput calls, robust long‑audio transcription, multi‑sample‑rate support…☆747Updated 3 months ago
- Qwen DianJin: LLMs for the Financial Industry by Alibaba Cloud(通义点金:阿里云金融大模型)☆420Updated this week
- An End-to-End Infrastructure for Training and Evaluating Various LLM Agents☆674Updated last week
- ☆193Updated last month
- A quick vibe coded app for deepseek OCR☆1,692Updated 2 months ago
- [KDD'2026] "VideoRAG: Chat with Your Videos"☆2,605Updated 3 weeks ago
- MiniMax-M2, a model built for Max coding & agentic workflows.☆2,298Updated 2 months ago
- ☆1,759Updated 4 months ago
- GLM-ASR-Nano: A robust, open-source speech recognition model with 1.5B parameters☆700Updated last month
- The official repo for "Vidi: Large Multimodal Models for Video Understanding and Editing"☆574Updated last week
- Seed-Coder is a family of lightweight open-source code LLMs comprising base, instruct and reasoning models, developed by ByteDance Seed.☆732Updated 7 months ago
- A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.☆1,426Updated 4 months ago
- ☆1,222Updated 3 months ago
- Video generation via code☆1,528Updated 2 months ago
- Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving stat…☆1,533Updated 7 months ago
- Official Code Repo for UniVA: Universal Video Agents☆305Updated 2 months ago
- Youtu-Tip: Tap for Intelligence, Keep on Device.☆549Updated this week
- Implementation of "Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length"☆1,524Updated this week