mfarre / Video-LLaVA-7B-hf-CinePile
Video-LlaVA fine-tune for CinePile evaluation
☆33Updated last month
Related projects: ⓘ
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture☆115Updated 2 weeks ago
- Official implementation of the paper "Interfacing Foundation Models' Embeddings"☆107Updated last month
- ☆74Updated 8 months ago
- This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.or…☆100Updated 2 months ago
- Repository for the paper: "TiC-CLIP: Continual Training of CLIP Models".☆90Updated 3 months ago
- Pytorch implementation of HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models☆27Updated 5 months ago
- This is the official repository of our paper "What If We Recaption Billions of Web Images with LLaMA-3 ?"☆115Updated 3 months ago
- ☆58Updated 10 months ago
- ☆65Updated this week
- Object Recognition as Next Token Prediction (CVPR 2024)☆153Updated 2 months ago
- ☆45Updated 2 months ago
- ☆131Updated 3 weeks ago
- Official Implementation for "MyVLM: Personalizing VLMs for User-Specific Queries" (ECCV 2024)☆142Updated 2 months ago
- EILEV: Efficient In-Context Learning in Vision-Language Models for Egocentric Videos☆108Updated 3 months ago
- Code and Data for Paper: SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data☆30Updated 6 months ago
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model☆36Updated last month
- This is a public repository for Image Clustering Conditioned on Text Criteria (IC|TC)☆74Updated 6 months ago
- ☆64Updated 11 months ago
- Matryoshka Multimodal Models☆67Updated 3 weeks ago
- Official implementation of the Law of Vision Representation in MLLMs☆93Updated last week
- Implementation of MaMMUT, a simple vision-encoder text-decoder architecture for multimodal tasks from Google, in Pytorch☆97Updated 11 months ago
- (WACV 2025) Vision-language conversation in 10 languages including English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi, B…☆77Updated last week
- Data-Efficient Multimodal Fusion on a Single GPU☆45Updated 4 months ago
- [CVPR 2024] Official PyTorch implementation of "ECLIPSE: Revisiting the Text-to-Image Prior for Efficient Image Generation"☆61Updated 4 months ago
- ☆47Updated 11 months ago
- VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models☆93Updated last month
- Official Implementation of weights2weights☆98Updated last week
- Efficient Multi-modal Models via Stage-wise Visual Context Compression☆34Updated last month
- CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts☆132Updated 3 months ago
- ☆84Updated 8 months ago