TencentARC / ViT-Lens
[CVPR 2024] ViT-Lens: Towards Omni-modal Representations
☆170Updated 2 weeks ago
Alternatives and similar repositories for ViT-Lens:
Users that are interested in ViT-Lens are comparing it to the libraries listed below
- [NeurIPS 2024] Official implementation of the paper "Interfacing Foundation Models' Embeddings"☆121Updated 6 months ago
- Official repository of paper "Subobject-level Image Tokenization"☆65Updated 9 months ago
- Pytorch code for paper From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models☆192Updated last month
- [ICML 2024] This repository includes the official implementation of our paper "Rejuvenating image-GPT as Strong Visual Representation Lea…☆97Updated 9 months ago
- [NeurIPS 2024] Efficient Large Multi-modal Models via Visual Context Compression☆51Updated this week
- Explore the Limits of Omni-modal Pretraining at Scale☆96Updated 5 months ago
- [CVPR'24] Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities☆98Updated 11 months ago
- [CVPR 24] The repository provides code for running inference and training for "Segment and Caption Anything" (SCA) , links for downloadin…☆211Updated 4 months ago
- Official repo for StableLLAVA☆94Updated last year
- A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models!☆125Updated last year
- ☆94Updated 9 months ago
- PixelLM is an effective and efficient LMM for pixel-level reasoning and understanding. PixelLM is accepted by CVPR 2024.☆206Updated last week
- [ICLR2025] Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want☆66Updated 3 weeks ago
- [ECCV2024] Official code implementation of Merlin: Empowering Multimodal LLMs with Foresight Minds☆90Updated 7 months ago
- [NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos☆105Updated last month
- ☆72Updated 9 months ago
- [NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding☆62Updated last month
- [ICLR 2025] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation☆229Updated 3 weeks ago
- Implementation of "VL-Mamba: Exploring State Space Models for Multimodal Learning"☆80Updated 11 months ago
- [ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, …☆101Updated 2 weeks ago
- [ICLR 2025] Diffusion Feedback Helps CLIP See Better☆258Updated 3 weeks ago
- [COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs☆134Updated 5 months ago
- VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".☆95Updated 7 months ago
- LAVIS - A One-stop Library for Language-Vision Intelligence☆47Updated 6 months ago
- Open source implementation of "Vision Transformers Need Registers"☆163Updated 3 weeks ago
- (2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding☆274Updated 7 months ago
- ☆160Updated 4 months ago
- Official Implementation of the CrossMAE paper: Rethinking Patch Dependence for Masked Autoencoders☆100Updated 2 months ago
- PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models☆250Updated last year
- This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.or…☆116Updated 7 months ago