ligeng0197 / Awesome-Thinking-With-ImagesView external linksLinks
Latest open-source "Thinking with images" (O3/O4-mini) papers, covering training-free, SFT-based, and RL-enhanced methods for "fine-grained visual understanding".
☆110Aug 21, 2025Updated 5 months ago
Alternatives and similar repositories for Awesome-Thinking-With-Images
Users that are interested in Awesome-Thinking-With-Images are comparing it to the libraries listed below
Sorting:
- Interleaving Reasoning: Next-Generation Reasoning Systems for AGI☆251Oct 17, 2025Updated 3 months ago
- WisdoMentor - Series: A LLM for undergraduates | 博导智言(辅助大学生 学习)☆12May 9, 2024Updated last year
- F-16 is a powerful video large language model (LLM) that perceives high-frame-rate videos, which is developed by the Department of Electr…☆34Jul 3, 2025Updated 7 months ago
- Generating Structured Pseudo Labels for Noise-resistant Zero-shot Video Sentence Localization☆16Jul 20, 2023Updated 2 years ago
- [ICCV 2023] Simple Baselines for Interactive Video Retrieval with Questions and Answers☆18Apr 16, 2024Updated last year
- 基于Llama3,通过进一步CPT,SFT,ORPO得到的中文版Llama3☆17Apr 24, 2024Updated last year
- Source code of our TCSVT'22 paper Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval☆19Feb 13, 2022Updated 4 years ago
- WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning☆36Jun 10, 2025Updated 8 months ago
- Source code of our MM'22 paper Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning☆21Jun 20, 2024Updated last year
- The trainer for HF to record losses of different tasks and objectives.☆49Mar 12, 2025Updated 11 months ago
- R1-Vision: Let's first take a look at the image☆48Feb 16, 2025Updated last year
- [NeurIPS 2025] Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing☆90Jul 27, 2025Updated 6 months ago
- [AAAI 2025] Open-vocabulary Video Instance Segmentation Codebase built upon Detectron2, which is really easy to use.☆25Dec 30, 2024Updated last year
- Official pytorch repository for "Knowing Where to Focus: Event-aware Transformer for Video Grounding" (ICCV 2023)☆55Sep 7, 2023Updated 2 years ago
- ScalingOpt - Optimization Community☆78Feb 4, 2026Updated last week
- ☆31Dec 18, 2025Updated last month
- Reading notes about Multimodal Large Language Models, Large Language Models, and Diffusion Models☆925Jan 31, 2026Updated 2 weeks ago
- Benchmark for federated noisy label learning☆25Aug 31, 2024Updated last year
- ☆99Aug 8, 2025Updated 6 months ago
- ☆27Aug 16, 2022Updated 3 years ago
- ☆27Jul 18, 2025Updated 6 months ago
- This repository provides valuable reference for researchers in the field of multimodality, please start your exploratory travel in RL-bas…☆1,350Dec 7, 2025Updated 2 months ago
- A curated list of awesome Multimodal studies.☆312Dec 14, 2025Updated 2 months ago
- Pytorch implementation of "ICCV2019-Learning a Mixture of Granularity-Specific Experts for Fine-Grained Categorization"☆26Apr 9, 2020Updated 5 years ago
- [ICCV 2025] Pytorch implementation of "VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Pr…☆48Jul 28, 2025Updated 6 months ago
- [CVPR 2024 Accepted] TaskWeave: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection☆29Sep 26, 2024Updated last year
- Extend OpenRLHF to support LMM RL training for reproduction of DeepSeek-R1 on multimodal tasks.☆840May 14, 2025Updated 9 months ago
- Code for ChatRex: Taming Multimodal LLM for Joint Perception and Understanding☆210Oct 15, 2025Updated 4 months ago
- Encodings for neural architecture search☆29Apr 5, 2021Updated 4 years ago
- 📖 A curated list of resources dedicated to hallucination of multimodal large language models (MLLM).☆979Sep 27, 2025Updated 4 months ago
- A reading list of papers about Visual Grounding.☆32Aug 24, 2022Updated 3 years ago
- revisiting open world object detection☆78Jun 19, 2022Updated 3 years ago
- Repository of proposal-free temporal moment localization work☆33Jun 11, 2024Updated last year
- VideoNSA: Native Sparse Attention Scales Video Understanding☆78Nov 16, 2025Updated 3 months ago
- ✨✨ [ICLR 2026] MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models☆43Apr 10, 2025Updated 10 months ago
- Introduction about AWESOME_ENTROPY+LRM_PAPERS☆30Dec 16, 2025Updated 2 months ago
- Code for the paper "VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use" [ICLR 2026]☆154Feb 7, 2026Updated last week
- ☆118Jul 22, 2025Updated 6 months ago
- Code for ICLR'24 workshop ME-FoMo-How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation☆38Oct 18, 2024Updated last year