[ICLR2024] Codes and Models for COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
☆43Dec 25, 2024Updated last year
Alternatives and similar repositories for COSA
Users that are interested in COSA are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Official PyTorch implementation of the paper "Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner"☆15Aug 9, 2023Updated 2 years ago
- ChatBridge, an approach to learning a unified multimodal model to interpret, correlate, and reason about various modalities without rely…☆55Sep 4, 2023Updated 2 years ago
- [TPAMI2024] Codes and Models for VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset☆311Dec 25, 2024Updated last year
- [NIPS2023] Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset☆302Mar 14, 2024Updated 2 years ago
- ☆12Nov 5, 2024Updated last year
- End-to-end encrypted cloud storage - Proton Drive • AdSpecial offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
- This repository contains the dataset, codebase, and benchmarks for our paper: <CNVid-3.5M: Build, Filter, and Pre-train the Large-scale P…☆26Nov 28, 2023Updated 2 years ago
- 【NeurIPS 2024】The official code of paper "Automated Multi-level Preference for MLLMs"☆22Sep 26, 2024Updated last year
- ☆19Dec 22, 2022Updated 3 years ago
- Official Code for "Knowing what it is: Semantic-enhanced Dual Attention Transformer" (TMM2022)☆19Oct 15, 2022Updated 3 years ago
- Code for "APTBench: Benchmarking Agentic Potential of Base LLMs During Pre-Training"☆42Dec 23, 2025Updated 6 months ago
- VideoNIAH: A Flexible Synthetic Method for Benchmarking Video MLLMs☆57Mar 9, 2025Updated last year
- Moatless Testbeds allows you to create isolated testbed environments in a Kubernetes cluster where you can apply code changes through git…☆14Apr 9, 2025Updated last year
- ☆14Aug 13, 2021Updated 4 years ago
- Official code for the ICLR 2025 paper, "Ada-K Routing: Boosting the Efficiency of MoE-based LLMs"☆12Mar 1, 2025Updated last year
- Bare Metal GPUs on DigitalOcean Gradient AI • AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- Controllable mage captioning model with unsupervised modes☆21Apr 14, 2023Updated 3 years ago
- MAVERICS (Manually-vAlidated Vq^2a Examples fRom Image-Caption datasetS) is a suite of test-only benchmarks for visual question answering…☆13Feb 18, 2023Updated 3 years ago
- Generating Video Caption Using LSTM☆12May 29, 2023Updated 3 years ago
- (CVPR2024)A benchmark for evaluating Multimodal LLMs using multiple-choice questions.☆364Jan 14, 2025Updated last year
- [ICCV 2023] With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning.☆19Jun 7, 2024Updated 2 years ago
- [ECCV 2020] PyTorch code of MMT (a multimodal transformer captioning model) on TVCaption dataset☆91Sep 6, 2023Updated 2 years ago
- Med-DANet Series (ECCV 2022 & WACV 2024)☆13Jan 2, 2024Updated 2 years ago
- Qt/Qml application using Google speech-to-text API to make voice commands☆11Jan 19, 2020Updated 6 years ago
- ☆40Feb 17, 2026Updated 4 months ago
- Serverless GPU API endpoints on Runpod - Get Bonus Credits • AdSkip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
- ☆26Nov 7, 2023Updated 2 years ago
- [ECCV'22 Poster] Explicit Image Caption Editing☆22Nov 30, 2022Updated 3 years ago
- ☆17Jun 4, 2023Updated 3 years ago
- [ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, …☆132Apr 4, 2025Updated last year
- [EMNLP 2025 Main] Official implementation of VRoPE: Rotary Position Embedding for Video Large Language Models.☆28Nov 18, 2025Updated 7 months ago
- Official code for the paper, "TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter".☆17Jun 20, 2023Updated 3 years ago
- [ICCVW 2021] Rethinking Content and Style: Exploring Bias for Unsupervised Disentanglement☆20Aug 18, 2021Updated 4 years ago
- OpenGL based 3D engine with a viewer API for pointcloud, surface meshes☆14May 10, 2023Updated 3 years ago
- Public repository for DORi: Discovering Object Relationships for Moment Localization of a Natural Language Query in a Video Code accompan…☆21Apr 7, 2021Updated 5 years ago
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- Video-Language Alignment via Spatio–Temporal Graph Transformer; ArXiv: https://arxiv.org/abs/2407.11677☆15Jul 24, 2024Updated last year
- [IJCAI 2024] CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning☆26Feb 1, 2024Updated 2 years ago
- [ACL 2024] PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain☆107Mar 14, 2024Updated 2 years ago
- Implementations of Multi-Task and Meta-Learning baselines for the Metaworld benchmark☆39May 20, 2026Updated last month
- Coursework for Mathematics for Machine Learning (70015) at Imperial College London☆10Nov 12, 2024Updated last year
- ☆18Jul 8, 2025Updated 11 months ago
- ☆260Dec 10, 2022Updated 3 years ago