hnam-1765 / WriteViTLinks
☆16Updated 4 months ago
Alternatives and similar repositories for WriteViT
Users that are interested in WriteViT are comparing it to the libraries listed below
Sorting:
- [PR 2024] A large Cross-Modal Video Retrieval Dataset with Reading Comprehension☆28Updated 2 years ago
- ☆17Updated 6 months ago
- "Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs" 2023☆16Updated last year
- [arXiv: 2505.12307] LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?☆34Updated 2 months ago
- Official PyTorch implementation of `[ACMMM 2023]Relational Contrastive Learning for Scene Text Recognition`☆17Updated 2 years ago
- [ICME 2023] FlowText: Synthesizing Realistic Scene Text Video with Optical Flow Estimation☆13Updated 2 years ago
- ☆36Updated 2 years ago
- VimTS: A Unified Video and Image Text Spotter☆79Updated last year
- [CVPR 2025] DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding☆23Updated last month
- ☆95Updated 10 months ago
- Fully Open Framework for Democratized Multimodal Reinforcement Learning.☆38Updated last month
- ECCV2024_Parrot Captions Teach CLIP to Spot Text☆66Updated last year
- The official repo for the technical report "Scalable Mask Annotation for Video Text Spotting"☆16Updated 2 years ago
- ☆56Updated 9 months ago
- [ICCV2025] A Token-level Text Image Foundation Model for Document Understanding☆129Updated 5 months ago
- WeGeFT: Weight‑Generative Fine‑Tuning for Multi‑Faceted Efficient Adaptation of Large Models☆22Updated 6 months ago
- Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision☆86Updated this week
- Official PyTorch implementation of "No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding"☆32Updated last year
- This repository is for the first survey on SAM & SAM2 for Videos.☆53Updated 9 months ago
- ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting☆45Updated 9 months ago
- [T-PAMI 2025] EMOv2: Pushing 5M Vision Model Frontier☆54Updated last year
- The official implementation of the paper "MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding". …☆62Updated last year
- The official repo for “TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding”.☆44Updated last year
- NAF-DPM: A Nonlinear Activation-Free Diffusion Probabilistic Model for Document Enhancement☆51Updated last year
- ☆13Updated 8 months ago
- High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning☆52Updated 6 months ago
- [PR 2025] The official GitHub page of "MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Ca…☆75Updated last month
- ☆46Updated 11 months ago
- Official implementation of URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding (AAAI 2026…☆33Updated 2 months ago
- LiVOS: Light Video Object Segmentation with Gated Linear Matching (CVPR 2025)☆45Updated 5 months ago