ChenAnno / Real20M_ACMMM2023Links
Official implementation for "Real20M: A Large-scale E-commerce Dataset for Cross-domain Retrieval"
☆26Updated last month
Alternatives and similar repositories for Real20M_ACMMM2023
Users that are interested in Real20M_ACMMM2023 are comparing it to the libraries listed below
Sorting:
- Official implementation for "FashionERN: Enhance-and-Refine Network for Composed Fashion Image Retrieval"☆19Updated last month
- Official implementation for "SPIRIT: Style-guided Patch Interaction for Fashion Image Retrieval with Text Feedback"☆17Updated last month
- [NeurIPS2023] Exploring Diverse In-Context Configurations for Image Captioning☆42Updated last year
- [CVPR 2025] LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant☆172Updated 4 months ago
- ☆80Updated last year
- Latest Advances on (RL based) Multimodal Reasoning and Generation in Multimodal Large Language Models☆43Updated last month
- A paper list about Token Merge, Reduce, Resample, Drop for MLLMs.☆75Updated last month
- A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability☆103Updated last year
- ✨✨[AAAI 2026] This is the official implementation of our paper "QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Vi…☆73Updated 7 months ago
- Latest open-source "Thinking with images" (O3/O4-mini) papers, covering training-free, SFT-based, and RL-enhanced methods for "fine-grain…☆103Updated 3 months ago
- ☆25Updated last year
- The official implementation of 《MLLMs-Augmented Visual-Language Representation Learning》☆31Updated last year
- ☆151Updated 9 months ago
- R1-like Video-LLM for Temporal Grounding☆125Updated 5 months ago
- Official repository for CoMM Dataset☆48Updated 11 months ago
- [ACM MM 2025] TimeChat-online: 80% Visual Tokens are Naturally Redundant in Streaming Videos☆94Updated 2 months ago
- Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency☆58Updated 5 months ago
- [ICLR 2025] TRACE: Temporal Grounding Video LLM via Casual Event Modeling☆138Updated 3 months ago
- [ACL 2025] PruneVid: Visual Token Pruning for Efficient Video Large Language Models☆57Updated 6 months ago
- official impelmentation of Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input☆67Updated last year
- MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer☆49Updated last year
- Code for paper "LLMs Can Evolve Continually on Modality for X-Modal Reasoning" NeurIPS2024☆40Updated 11 months ago
- [ACL’24 Findings] Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives☆45Updated 4 months ago
- ☆11Updated 8 months ago
- [NeurIPS 2022 Spotlight] Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations☆141Updated last year
- [CVPR2025] Number it: Temporal Grounding Videos like Flipping Manga☆131Updated last month
- Official implementation of HawkEye: Training Video-Text LLMs for Grounding Text in Videos☆44Updated last year
- VideoNIAH: A Flexible Synthetic Method for Benchmarking Video MLLMs☆51Updated 8 months ago
- [ICCV 2023] ALIP: Adaptive Language-Image Pre-training with Synthetic Caption☆102Updated 2 years ago
- (CVPR 2025) PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction☆134Updated 9 months ago