Latest open-source "Thinking with images" (O3/O4-mini) papers, covering training-free, SFT-based, and RL-enhanced methods for "fine-grained visual understanding".
☆113Aug 21, 2025Updated 9 months ago
Alternatives and similar repositories for Awesome-Thinking-With-Images
Users that are interested in Awesome-Thinking-With-Images are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- 使用fastrtc框架调用qwen-2.5-omni-realtime实现实时语音、视频等☆14Jun 27, 2025Updated 11 months ago
- F-16 is a powerful video large language model (LLM) that perceives high-frame-rate videos, which is developed by the Department of Electr…☆37Jul 3, 2025Updated 11 months ago
- Resources and paper list for "Thinking with Images for LVLMs". This repository accompanies our survey on how LVLMs can leverage visual in…☆1,485Mar 9, 2026Updated 3 months ago
- Interleaving Reasoning: Next-Generation Reasoning Systems for AGI☆280Jun 5, 2026Updated 2 weeks ago
- ☆12Mar 22, 2025Updated last year
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- WisdoMentor - Series: A LLM for undergraduates | 博导智言(辅助大学生 学习)☆13May 9, 2024Updated 2 years ago
- 基于Llama3,通过进一步CPT,SFT,ORPO得到的中文版Llama3☆16Apr 24, 2024Updated 2 years ago
- WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning☆36Jun 10, 2025Updated last year
- Reading notes about Multimodal Large Language Models, Large Language Models, and Diffusion Models☆1,150Jun 1, 2026Updated 2 weeks ago
- The trainer for HF to record losses of different tasks and objectives.☆54Mar 12, 2025Updated last year
- Streaming Video Instruction Tuning☆75Feb 25, 2026Updated 3 months ago
- ☆43Jul 1, 2025Updated 11 months ago
- [AAAI 2025] Open-vocabulary Video Instance Segmentation Codebase built upon Detectron2, which is really easy to use.☆26Dec 30, 2024Updated last year
- ☆19Sep 19, 2024Updated last year
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- Official pytorch repository for "Knowing Where to Focus: Event-aware Transformer for Video Grounding" (ICCV 2023)☆55Sep 7, 2023Updated 2 years ago
- ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning. In ICCV, 2021.☆64Nov 18, 2021Updated 4 years ago
- Code for Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model☆13Feb 15, 2024Updated 2 years ago
- A Practical Zoom-in GUI Grounding and Behavior-Based Evaluation method.☆25Dec 8, 2025Updated 6 months ago
- Source code of our TCSVT'22 paper Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval☆19Feb 13, 2022Updated 4 years ago
- Official implementation of the paper “Endowing Vision-Language Models with System 2 Thinking for Fine-Grained Visual Recognition,” AAAI 2…☆42Jan 30, 2026Updated 4 months ago
- HEtero-Assists Distillation for Heterogeneous Object Detectors☆10Jul 3, 2023Updated 2 years ago
- ☆22Sep 16, 2025Updated 9 months ago
- ☆1,237Nov 20, 2025Updated 6 months ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization☆104Jan 30, 2024Updated 2 years ago
- Introduction about AWESOME_ENTROPY+LRM_PAPERS☆31Dec 16, 2025Updated 6 months ago
- A curated list of awesome Multimodal studies.☆340Jun 12, 2026Updated last week
- ☆27Aug 16, 2022Updated 3 years ago
- A Python implementation of an agent swarm system that works with local LLM servers. The system allows you to create multiple agents that …☆13Nov 20, 2024Updated last year
- R1-Vision: Let's first take a look at the image☆48Feb 16, 2025Updated last year
- Code for ChatRex: Taming Multimodal LLM for Joint Perception and Understanding☆214Oct 15, 2025Updated 8 months ago
- This repository provides valuable reference for researchers in the field of multimodality, please start your exploratory travel in RL-bas…☆1,424May 11, 2026Updated last month
- [ICLR 2026] "VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use"☆193Mar 20, 2026Updated 2 months ago
- GPUs on demand by Runpod - Special Offer Available • AdRun AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
- ☆14Jan 5, 2022Updated 4 years ago
- The code for paper: "DC-Net: Divide-and-Conquer for Salient Object Detection"☆21Aug 30, 2024Updated last year
- 📖 A curated list of resources dedicated to hallucination of multimodal large language models (MLLM).☆1,024Sep 27, 2025Updated 8 months ago
- ☆99Mar 13, 2026Updated 3 months ago
- Extended Inductive Reasoning for Personalized Preference Inference from Behavioral Signals☆11Jan 8, 2026Updated 5 months ago
- ☆33Dec 18, 2025Updated 6 months ago
- Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering☆63Dec 5, 2024Updated last year