wangyuchi369 / LaDiC
[NAACL 2024] LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-text Generation?
☆38Updated 10 months ago
Alternatives and similar repositories for LaDiC:
Users that are interested in LaDiC are comparing it to the libraries listed below
- [NeurIPS 2024] Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective☆68Updated 6 months ago
- Official implement of MIA-DPO☆56Updated 3 months ago
- WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation☆81Updated 3 weeks ago
- (ICLR 2025 Spotlight) Official code repository for Interleaved Scene Graph.☆21Updated 3 months ago
- ☆35Updated 9 months ago
- [CVPR'2025] VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".☆155Updated 2 months ago
- A Massive Multi-Discipline Lecture Understanding Benchmark☆16Updated this week
- HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation☆57Updated 2 months ago
- NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation☆53Updated last week
- ☆82Updated last month
- Official Implementation of ICLR'24: Kosmos-G: Generating Images in Context with Multimodal Large Language Models☆71Updated 11 months ago
- Official repository for paper "Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation"☆16Updated last week
- VPEval Codebase from Visual Programming for Text-to-Image Generation and Evaluation (NeurIPS 2023)☆44Updated last year
- Code for MetaMorph Multimodal Understanding and Generation via Instruction Tuning☆151Updated 2 weeks ago
- [ICML 2024] On Discrete Prompt Optimization for Diffusion Models - Google☆53Updated 8 months ago
- CLIP-MoE: Mixture of Experts for CLIP☆32Updated 6 months ago
- [ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, …☆111Updated last month
- Official repo for StableLLAVA☆95Updated last year
- [NeurIPS'24] Official PyTorch Implementation of Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment☆56Updated 7 months ago
- [ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark☆99Updated last week
- [NeurIPS 2024] Efficient Large Multi-modal Models via Visual Context Compression☆55Updated 2 months ago
- Code and Data for "GenAI Arena: An Open Evaluation Platform for Generative Models" [NeurIPS 2024]☆19Updated 7 months ago
- [EMNLP 2023] TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding☆50Updated last year
- [CVPR 2025] OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?☆55Updated last month
- ✨✨The Curse of Multi-Modalities (CMM): Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio☆46Updated 6 months ago
- [ICLR 2025] CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion☆43Updated 3 months ago
- LLMBind: A Unified Modality-Task Integration Framework☆18Updated 10 months ago
- Visual Programming for Text-to-Image Generation and Evaluation (NeurIPS 2023)☆56Updated last year
- Codes for ICLR 2025 Paper: Towards Semantic Equivalence of Tokenization in Multimodal LLM☆57Updated 2 weeks ago
- ☆24Updated 2 months ago