Latest open-source "Thinking with images" (O3/O4-mini) papers, covering training-free, SFT-based, and RL-enhanced methods for "fine-grained visual understanding".
☆111Aug 21, 2025Updated 6 months ago
Alternatives and similar repositories for Awesome-Thinking-With-Images
Users that are interested in Awesome-Thinking-With-Images are comparing it to the libraries listed below
Sorting:
- Resources and paper list for "Thinking with Images for LVLMs". This repository accompanies our survey on how LVLMs can leverage visual in…☆1,346Feb 3, 2026Updated last month
- WisdoMentor - Series: A LLM for undergraduates | 博导智言(辅助大学生 学习)☆13May 9, 2024Updated last year
- ☆12Mar 22, 2025Updated 11 months ago
- F-16 is a powerful video large language model (LLM) that perceives high-frame-rate videos, which is developed by the Department of Electr…☆34Jul 3, 2025Updated 8 months ago
- Generating Structured Pseudo Labels for Noise-resistant Zero-shot Video Sentence Localization☆16Jul 20, 2023Updated 2 years ago
- ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning. In ICCV, 2021.☆63Nov 18, 2021Updated 4 years ago
- [ICCV 2023] Simple Baselines for Interactive Video Retrieval with Questions and Answers☆19Apr 16, 2024Updated last year
- 基于Llama3,通过进一步CPT,SFT,ORPO得到的中文版Llama3☆17Apr 24, 2024Updated last year
- Source code of our TCSVT'22 paper Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval☆19Feb 13, 2022Updated 4 years ago
- ☆32Aug 7, 2025Updated 7 months ago
- Dreambooth (LoRA) with well-organized code structure. Naive adaptation from 🤗Diffusers.☆17May 18, 2023Updated 2 years ago
- Reading notes about Multimodal Large Language Models, Large Language Models, and Diffusion Models☆1,001Feb 19, 2026Updated 2 weeks ago
- WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning☆36Jun 10, 2025Updated 9 months ago
- Streaming Video Instruction Tuning☆52Feb 25, 2026Updated last week
- Source code of our MM'22 paper Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning☆21Jun 20, 2024Updated last year
- R1-Vision: Let's first take a look at the image☆48Feb 16, 2025Updated last year
- ☆25Oct 31, 2024Updated last year
- The trainer for HF to record losses of different tasks and objectives.☆54Mar 12, 2025Updated 11 months ago
- [NeurIPS 2025] Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing☆91Jul 27, 2025Updated 7 months ago
- Official pytorch repository for "Knowing Where to Focus: Event-aware Transformer for Video Grounding" (ICCV 2023)☆55Sep 7, 2023Updated 2 years ago
- ☆31Dec 18, 2025Updated 2 months ago
- ☆1,145Nov 20, 2025Updated 3 months ago
- Benchmark for federated noisy label learning☆25Aug 31, 2024Updated last year
- ☆27Aug 16, 2022Updated 3 years ago
- ☆27Jul 18, 2025Updated 7 months ago
- This repository provides valuable reference for researchers in the field of multimodality, please start your exploratory travel in RL-bas…☆1,365Feb 26, 2026Updated last week
- A curated list of awesome Multimodal studies.☆316Dec 14, 2025Updated 2 months ago
- [CVPR 2024 Accepted] TaskWeave: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection☆29Sep 26, 2024Updated last year
- Extend OpenRLHF to support LMM RL training for reproduction of DeepSeek-R1 on multimodal tasks.☆842May 14, 2025Updated 9 months ago
- Encodings for neural architecture search☆29Apr 5, 2021Updated 4 years ago
- 📖 A curated list of resources dedicated to hallucination of multimodal large language models (MLLM).☆986Sep 27, 2025Updated 5 months ago
- A reading list of papers about Visual Grounding.☆32Aug 24, 2022Updated 3 years ago
- Introduction about AWESOME_ENTROPY+LRM_PAPERS☆30Dec 16, 2025Updated 2 months ago
- 🔥 A curated roadmap to the Efficient VLA landscape. We’re keeping this list live—contribute your latest work!☆85Feb 17, 2026Updated 2 weeks ago
- Repository of proposal-free temporal moment localization work☆33Jun 11, 2024Updated last year
- Scalable DBSCAN and OPTICS for clustering high-dimensional datasets using random projections☆13Nov 1, 2024Updated last year
- ✨✨ [ICLR 2026] MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models☆43Apr 10, 2025Updated 10 months ago
- [ICLR 2026] "VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use"☆160Feb 7, 2026Updated last month
- ☆118Jul 22, 2025Updated 7 months ago