Code for paper: Reinforced Vision Perception with Tools
☆71Oct 3, 2025Updated 5 months ago
Alternatives and similar repositories for REVPT
Users that are interested in REVPT are comparing it to the libraries listed below
Sorting:
- Code accompanying the 2022 DLS paper "Misleading Deep-Fake Detection with GAN Fingerprints"☆10May 26, 2022Updated 3 years ago
- VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection☆25May 31, 2025Updated 9 months ago
- paper-read-notes☆12Sep 26, 2024Updated last year
- Implementation of Pix2Seq in PyTorch☆10Feb 3, 2022Updated 4 years ago
- Professional desktop app for converting text to audiobooks with local TTS☆30Oct 6, 2025Updated 5 months ago
- [CVPR 2024] Adapting Short-Term Transformers for Action Detection in Untrimmed Videos☆12Jun 11, 2024Updated last year
- Open-vocabulary Semantic Segmentation☆33Feb 16, 2024Updated 2 years ago
- This repository contains a PyTorch implementation of the ICSE'26 paper "Scrub It Out! Erasing Sensitive Memorization in Code Language Mod…☆30Sep 18, 2025Updated 5 months ago
- From Word to World: Can Large Language Models be Implicit Text-based World Models?☆50Dec 25, 2025Updated 2 months ago
- #ICCV, #MoE, #Tracking☆33Jul 11, 2025Updated 7 months ago
- Building an Intelligent AWS Cloud Engineer Agent with Strands Agents SDK☆23Dec 16, 2025Updated 2 months ago
- AdaIFL: Adaptive Image Forgery Localization via a Dynamic and Importance-aware Transformer Network☆16Feb 11, 2025Updated last year
- Improving Math reasoning through Direct Preference Optimization with Verifiable Pairs☆19Mar 20, 2025Updated 11 months ago
- ☆18Mar 1, 2024Updated 2 years ago
- CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms☆25Dec 21, 2025Updated 2 months ago
- ☆75Jun 28, 2025Updated 8 months ago
- [ECCV24] VISA: Reasoning Video Object Segmentation via Large Language Model☆19Jul 20, 2024Updated last year
- ☆21Jan 17, 2025Updated last year
- [NeurIPS 2024] Official PyTorch implementation of LoTLIP: Improving Language-Image Pre-training for Long Text Understanding☆50Jan 14, 2025Updated last year
- This repository contains the code for the paper - "Aligning Text, Images, and 3D Structure Token-by-Token" (CVPR 2026)☆44Jun 11, 2025Updated 8 months ago
- MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision☆28May 26, 2025Updated 9 months ago
- Official Implementation of "Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning"☆25Dec 16, 2025Updated 2 months ago
- An image retrieval system based on MXNET : From training to website☆18Jul 9, 2019Updated 6 years ago
- [ECCV2024] ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference☆97Mar 26, 2025Updated 11 months ago
- [CVPR'24] Code for Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models☆18Jul 22, 2024Updated last year
- Code for "DAMEX: Dataset-aware Mixture-of-Experts for visual understanding of mixture-of-datasets", accepted at Neurips 2023 (Main confer…☆27Mar 29, 2024Updated last year
- [ECCV 2024] SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding☆64Oct 22, 2024Updated last year
- [NeurIPS 2025] Official code implementation of Perception R1: Pioneering Perception Policy with Reinforcement Learning☆286Jul 15, 2025Updated 7 months ago
- [SCIS 2024] The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Di…☆62Nov 7, 2024Updated last year
- DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding☆66Jun 10, 2025Updated 8 months ago
- This repo holds the competitions (information, solutions, summaries, memories) that our team has participated in☆25Feb 4, 2024Updated 2 years ago
- [AAAI 2026 Oral] The official code of "UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning"☆66Dec 8, 2025Updated 3 months ago
- ☆28May 22, 2025Updated 9 months ago
- Offical implementation of "Re-Aligning Language to Visual Objects with an Agentic Workflow"☆31Apr 20, 2025Updated 10 months ago
- A unified framework for controllable caption generation across images, videos, and audio. Supports multi-modal inputs and customizable ca…☆52Jul 24, 2025Updated 7 months ago
- Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning☆41Updated this week
- MMSearch-R1 is an end-to-end RL framework that enables LMMs to perform on-demand, multi-turn search with real-world multimodal search too…☆408Aug 26, 2025Updated 6 months ago
- (ICLR 2025 Spotlight) Official code repository for Interleaved Scene Graph.☆31Aug 7, 2025Updated 7 months ago
- Fastest way to scaffold FastHTML applications.☆36Sep 13, 2025Updated 5 months ago