SHI-Labs / VCoder
VCoder: Versatile Vision Encoders for Multimodal Large Language Models, arXiv 2023 / CVPR 2024
☆261Updated 7 months ago
Related projects ⓘ
Alternatives and complementary repositories for VCoder
- Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"☆142Updated last week
- Multimodal Models in Real World☆403Updated 3 weeks ago
- ☆146Updated last month
- LLaVA-Interactive-Demo☆352Updated 3 months ago
- Data release for the ImageInWords (IIW) paper.☆200Updated this week
- Long Context Transfer from Language to Vision☆334Updated 3 weeks ago
- ☆278Updated 2 weeks ago
- LLaVA-HR: High-Resolution Large Language-Vision Assistant☆212Updated 3 months ago
- Official repository for the paper PLLaVA☆593Updated 3 months ago
- ☆166Updated 4 months ago
- Official implementation of SEED-LLaMA (ICLR 2024).☆579Updated 2 months ago
- a family of highly capabale yet efficient large multimodal models☆166Updated 2 months ago
- LLM2CLIP makes SOTA pretrained CLIP model more SOTA ever.☆242Updated this week
- [CVPR 24] The repository provides code for running inference and training for "Segment and Caption Anything" (SCA) , links for downloadin…☆202Updated last month
- [ICLR 2024 Spotlight] DreamLLM: Synergistic Multimodal Comprehension and Creation☆395Updated 7 months ago
- This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"☆130Updated 3 months ago
- [CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts☆297Updated 4 months ago
- [ICCV2023] Segment Every Reference Object in Spatial and Temporal Spaces☆235Updated 10 months ago
- Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding☆217Updated 3 months ago
- [TMLR23] Official implementation of UnIVAL: Unified Model for Image, Video, Audio and Language Tasks.☆224Updated 10 months ago
- SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models☆173Updated 2 months ago
- Offical Code for GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation☆134Updated 3 weeks ago
- ☆165Updated 4 months ago
- InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions☆126Updated 9 months ago
- [NeurIPS'23] "MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing".☆311Updated 5 months ago
- Official repo for StableLLAVA☆91Updated 10 months ago
- Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"☆312Updated this week
- Official code for Paper "Mantis: Multi-Image Instruction Tuning" (TMLR2024)☆184Updated this week
- ControlLLM: Augment Language Models with Tools by Searching on Graphs☆186Updated 4 months ago
- CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts☆134Updated 5 months ago