☆72Jul 17, 2024Updated last year
Alternatives and similar repositories for QA-ViT
Users that are interested in QA-ViT are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ☆24Apr 29, 2025Updated last year
- ☆13May 21, 2024Updated 2 years ago
- Official repository for "Boosting Audio Visual Question Answering via Key Semantic-Aware Cues" in ACM MM 2024.☆16Oct 25, 2024Updated last year
- FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions☆55Apr 17, 2024Updated 2 years ago
- (ICCV 2025) Enhance CLIP and MLLM's fine-grained visual representations with generative models.☆78Jun 25, 2025Updated 10 months ago
- End-to-end encrypted cloud storage - Proton Drive • AdSpecial offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
- Multimodal Semi-Supervised Learning for Text Recognition (SemiMTR)☆83Sep 12, 2023Updated 2 years ago
- ☆15May 10, 2026Updated last week
- ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs☆29Aug 15, 2025Updated 9 months ago
- official repo for paper "[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs"☆22Apr 23, 2025Updated last year
- STVQA and TextVQA OCR results from Amazon Text in Image pipeline☆12Jul 18, 2022Updated 3 years ago
- ☆16Dec 25, 2021Updated 4 years ago
- [AAAI 24] Official Codebase for BridgeQA: Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA☆27Jul 12, 2024Updated last year
- ☆14May 3, 2022Updated 4 years ago
- ☆12Jul 19, 2022Updated 3 years ago
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- ☆32Jul 29, 2024Updated last year
- TextAdaIN: Paying Attention to Shortcut Learning in Text Recognizers☆21Jul 26, 2022Updated 3 years ago
- Official Repository for "Learning Trimodal Relation for Audio-Visual Question Answering with Missing Modality" (ECCV 2024)☆16Oct 29, 2024Updated last year
- Code for paper: "Privately generating tabular data using language models".☆15Jun 13, 2023Updated 2 years ago
- MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities (ICML 2024)☆326Jan 20, 2025Updated last year
- Some papers about *diverse* image (a few videos) captioning☆26Apr 4, 2023Updated 3 years ago
- "Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs" 2023☆16Nov 28, 2024Updated last year
- [CVPR 2024] Do you remember? Dense Video Captioning with Cross-Modal Memory Retrieval☆66Jun 19, 2024Updated last year
- Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering [ACM MM'24]☆10Jul 22, 2024Updated last year
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Code for Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models☆91Jun 28, 2024Updated last year
- Question-Aware Gaussian Experts for Audio-Visual Question Answering -- Official Pytorch Implementation (CVPR'25, Highlight)☆29Jun 6, 2025Updated 11 months ago
- ☆29Jul 25, 2025Updated 9 months ago
- Unified Audio-Visual Perception for Multi-Task Video Localization☆31Apr 19, 2024Updated 2 years ago
- ☆38Jul 24, 2023Updated 2 years ago
- 🔥 [NeurIPS 2025] Official implementation of "Generate, but Verify: Reducing Visual Hallucination in Vision-Language Models with Retrospe…☆57Jan 22, 2026Updated 4 months ago
- ☆28Jul 18, 2025Updated 10 months ago
- ☆15Jan 9, 2026Updated 4 months ago
- ☆17Sep 23, 2024Updated last year
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types☆32Jul 16, 2025Updated 10 months ago
- The repo for "On-the-fly Modulation for Balanced Multimodal Learning", T-PAMI 2024☆19Sep 29, 2024Updated last year
- An implementation of the Holistic Pursuit for the Multi-Layer Sparse Coding model. Contains a comparison to the projection pursuit algori…☆19Dec 19, 2018Updated 7 years ago
- ☆97Sep 19, 2024Updated last year
- [CVPR 2025] COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training☆41Mar 27, 2025Updated last year
- Can 3D Vision-Language Models Truly Understand Natural Language?☆20Mar 28, 2024Updated 2 years ago
- ☆11May 24, 2024Updated last year