Research Code for Multimodal-Cognition Team in Ant Group
☆173Oct 14, 2025Updated 5 months ago
Alternatives and similar repositories for Ant-Multi-Modal-Framework
Users that are interested in Ant-Multi-Modal-Framework are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning☆46Jul 17, 2025Updated 8 months ago
- [ECCV 2024🔥] Official implementation of the paper "ST-LLM: Large Language Models Are Effective Temporal Learners"☆154Sep 10, 2024Updated last year
- This repository contains the dataset, codebase, and benchmarks for our paper: <CNVid-3.5M: Build, Filter, and Pre-train the Large-scale P…☆25Nov 28, 2023Updated 2 years ago
- A Large-Scale Chinese Image-Text Benchmark for Real-World Short Video Search Scenarios☆13Jan 24, 2024Updated 2 years ago
- [AAAI 2024] DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval.☆47Oct 14, 2024Updated last year
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- [NeurIPS 2024] Official PyTorch implementation of LoTLIP: Improving Language-Image Pre-training for Long Text Understanding☆50Jan 14, 2025Updated last year
- ☆12Jan 10, 2025Updated last year
- [ACL 2023] VSTAR is a multimodal dialogue dataset with scene and topic transition information☆15Oct 27, 2024Updated last year
- ☆17Oct 15, 2023Updated 2 years ago
- ☆21Aug 8, 2024Updated last year
- [ECCV 2024] official code for "Long-CLIP: Unlocking the Long-Text Capability of CLIP"☆894Aug 13, 2024Updated last year
- A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.☆168Jan 30, 2025Updated last year
- 【CVPR'2023 Highlight & TPAMI】Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?☆258Nov 29, 2024Updated last year
- Open-Vocabulary High-Resolution Remote Sensing Image Semantic Segmentation☆29Sep 19, 2025Updated 6 months ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click and start building anything your business needs.
- [AAAI 2024] GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval☆20May 10, 2024Updated last year
- [ICCV 2023] ALIP: Adaptive Language-Image Pre-training with Synthetic Caption☆105Sep 18, 2023Updated 2 years ago
- ☆30Aug 14, 2023Updated 2 years ago
- ☆26Aug 4, 2020Updated 5 years ago
- An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"☆1,026Apr 12, 2024Updated last year
- Multimodal Open-O1 (MO1) is designed to enhance the accuracy of inference models by utilizing a novel prompt-based approach. This tool wo…☆29Sep 25, 2024Updated last year
- ☆19Jun 10, 2025Updated 9 months ago
- Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning☆21Feb 19, 2025Updated last year
- [CVPR2023] All in One: Exploring Unified Video-Language Pre-training☆281Mar 25, 2023Updated 3 years ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click and start building anything your business needs.
- Official implementation of CVPR 2024 paper "vid-TLDR: Training Free Token merging for Light-weight Video Transformer".☆55Oct 21, 2025Updated 5 months ago
- Pytorch Code for "Unified Coarse-to-Fine Alignment for Video-Text Retrieval" (ICCV 2023)☆66Jun 7, 2024Updated last year
- This repo is used for downloading the videos for SVD dataset.☆18Aug 16, 2020Updated 5 years ago
- [ECCV 2024] Official PyTorch implementation of DreamLIP: Language-Image Pre-training with Long Captions☆138May 8, 2025Updated 10 months ago
- ☆25Mar 29, 2025Updated 11 months ago
- Chinese-native image generation while compatible with SD eco-system, 1st-gen, AAAI2025☆13Jun 25, 2024Updated last year
- Official pytorch repository for "TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection" (AAAI 2024 Pape…☆57Feb 22, 2025Updated last year
- ☆60Aug 10, 2023Updated 2 years ago
- Papers about the ultra high resolution tasks.☆13Jul 12, 2024Updated last year
- Wordpress hosting with auto-scaling on Cloudways • AdFully Managed hosting built for WordPress-powered businesses that need reliable, auto-scalable hosting. Cloudways SafeUpdates now available.
- Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval (ICCV 2025 Highlight)☆21Aug 1, 2025Updated 7 months ago
- Chain-of-Frames [CVPR 2026]☆38Jul 2, 2025Updated 8 months ago
- LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning☆77May 23, 2025Updated 10 months ago
- EVA Series: Visual Representation Fantasies from BAAI☆2,655Aug 1, 2024Updated last year
- InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions☆2,923May 26, 2025Updated 10 months ago
- [ICCV 2025] Explore the Limits of Omni-modal Pretraining at Scale☆124Sep 2, 2024Updated last year
- mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video (ICML 2023)☆228Jul 21, 2023Updated 2 years ago