[ICLR2024] Codes and Models for COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
☆43Dec 25, 2024Updated last year
Alternatives and similar repositories for COSA
Users that are interested in COSA are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Official PyTorch implementation of the paper "Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner"☆15Aug 9, 2023Updated 2 years ago
- ChatBridge, an approach to learning a unified multimodal model to interpret, correlate, and reason about various modalities without rely…☆55Sep 4, 2023Updated 2 years ago
- [TPAMI2024] Codes and Models for VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset☆310Dec 25, 2024Updated last year
- This repo holds the official code and data for "Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with H…☆16May 21, 2024Updated last year
- This repo holds the official code and data for "Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentati…☆72Jun 3, 2024Updated last year
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- [NIPS2023] Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset☆300Mar 14, 2024Updated 2 years ago
- ☆35Jun 6, 2023Updated 2 years ago
- ☆12Nov 5, 2024Updated last year
- This repository contains the dataset, codebase, and benchmarks for our paper: <CNVid-3.5M: Build, Filter, and Pre-train the Large-scale P…☆25Nov 28, 2023Updated 2 years ago
- 【NeurIPS 2024】The official code of paper "Automated Multi-level Preference for MLLMs"☆22Sep 26, 2024Updated last year
- Official Code for "Knowing what it is: Semantic-enhanced Dual Attention Transformer" (TMM2022)☆19Oct 15, 2022Updated 3 years ago
- VideoNIAH: A Flexible Synthetic Method for Benchmarking Video MLLMs☆57Mar 9, 2025Updated last year
- ☆14Aug 13, 2021Updated 4 years ago
- Official code for the ICLR 2025 paper, "Ada-K Routing: Boosting the Efficiency of MoE-based LLMs"☆12Mar 1, 2025Updated last year
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Controllable mage captioning model with unsupervised modes☆21Apr 14, 2023Updated 3 years ago
- MAVERICS (Manually-vAlidated Vq^2a Examples fRom Image-Caption datasetS) is a suite of test-only benchmarks for visual question answering…☆13Feb 18, 2023Updated 3 years ago
- (CVPR2024)A benchmark for evaluating Multimodal LLMs using multiple-choice questions.☆363Jan 14, 2025Updated last year
- [ICCV 2023] With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning.☆19Jun 7, 2024Updated last year
- [ECCV 2020] PyTorch code of MMT (a multimodal transformer captioning model) on TVCaption dataset☆91Sep 6, 2023Updated 2 years ago
- [ECCV'22 Poster] Explicit Image Caption Editing☆22Nov 30, 2022Updated 3 years ago
- ☆16Jun 4, 2023Updated 2 years ago
- [ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, …☆131Apr 4, 2025Updated last year
- [EMNLP 2025 Main] Official implementation of VRoPE: Rotary Position Embedding for Video Large Language Models.☆27Nov 18, 2025Updated 5 months ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- [ICCVW 2021] Rethinking Content and Style: Exploring Bias for Unsupervised Disentanglement☆20Aug 18, 2021Updated 4 years ago
- OpenGL based 3D engine with a viewer API for pointcloud, surface meshes☆14May 10, 2023Updated 2 years ago
- Public repository for DORi: Discovering Object Relationships for Moment Localization of a Natural Language Query in a Video Code accompan…☆21Apr 7, 2021Updated 5 years ago
- Video-Language Alignment via Spatio–Temporal Graph Transformer; ArXiv: https://arxiv.org/abs/2407.11677☆14Jul 24, 2024Updated last year
- ☆24May 8, 2024Updated last year
- [ACL 2024] PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain☆107Mar 14, 2024Updated 2 years ago
- Implementations of Multi-Task and Meta-Learning baselines for the Metaworld benchmark☆35Aug 20, 2025Updated 8 months ago
- CaMEL: Mean Teacher Learning for Image Captioning. ICPR 2022☆30Dec 1, 2022Updated 3 years ago
- Coursework for Mathematics for Machine Learning (70015) at Imperial College London☆10Nov 12, 2024Updated last year
- GPUs on demand by Runpod - Special Offer Available • AdRun AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
- ☆18Jul 8, 2025Updated 9 months ago
- Official code for CVPR 2024 paper, "SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models"☆16Apr 22, 2024Updated 2 years ago
- ☆259Dec 10, 2022Updated 3 years ago
- [ICML 2025] Code for "R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts"☆19Mar 10, 2025Updated last year
- [CVPR 2023] Official code for paper: Exploiting Completeness and Uncertainty of Pseudo Labels for Weakly Supervised Video Anomaly Detecti…☆30Jun 23, 2023Updated 2 years ago
- [ECCV 2022] A pytorch implementation for TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval☆79Nov 29, 2022Updated 3 years ago
- [NeurIPS 2024] Calibrated Self-Rewarding Vision Language Models☆88Oct 26, 2025Updated 6 months ago