zjr2000 / Awesome-Multimodal-Chatbot
Awesome Multimodal Assistant is a curated list of multimodal chatbots/conversational assistants that utilize various modes of interaction, such as text, speech, images, and videos, to provide a seamless and versatile user experience.
☆74Updated last year
Alternatives and similar repositories for Awesome-Multimodal-Chatbot:
Users that are interested in Awesome-Multimodal-Chatbot are comparing it to the libraries listed below
- VideoLLM: Modeling Video Sequence with Large Language Models☆154Updated last year
- PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models☆249Updated last year
- ☆66Updated last year
- Recent advancements propelled by large language models (LLMs), encompassing an array of domains including Vision, Audio, Agent, Robotics,…☆117Updated last month
- Official repo for StableLLAVA☆94Updated last year
- (2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding☆272Updated 6 months ago
- Code for paper "VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos"☆92Updated 5 months ago
- [CVPR 2024] Prompt Highlighter: Interactive Control for Multi-Modal LLMs☆138Updated 6 months ago
- ☆92Updated 8 months ago
- (CVPR2024)A benchmark for evaluating Multimodal LLMs using multiple-choice questions.☆326Updated 2 weeks ago
- Explore VLM-Eval, a framework for evaluating Video Large Language Models, enhancing your video analysis with cutting-edge AI technology.☆31Updated last year
- ☆156Updated 3 months ago
- Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"☆143Updated this week
- A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.☆107Updated 4 months ago
- The official repository of "Video assistant towards large language model makes everything easy"☆217Updated last month
- Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models☆127Updated last month
- This is the official code of VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding (ECCV 2024)☆163Updated last month
- Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs☆72Updated 3 months ago
- [TMLR] Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"☆128Updated 2 months ago
- CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts☆138Updated 7 months ago
- [NeurIPS2024] VideoGUI: A Benchmark for GUI Automation from Instructional Videos☆27Updated last month
- [ECCV 2024🔥] Official implementation of the paper "ST-LLM: Large Language Models Are Effective Temporal Learners"☆137Updated 4 months ago
- Long Context Transfer from Language to Vision☆359Updated 2 months ago
- Official code for Paper "Mantis: Multi-Image Instruction Tuning" (TMLR2024)☆196Updated this week
- [COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs☆134Updated 5 months ago
- MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities (ICML 2024)☆282Updated last week
- [ICLR 2024 & ECCV 2024] The All-Seeing Projects: Towards Panoptic Visual Recognition&Understanding and General Relation Comprehension of …☆473Updated 5 months ago
- [CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts☆308Updated 6 months ago
- LLaVA-HR: High-Resolution Large Language-Vision Assistant☆223Updated 5 months ago
- A Survey on Benchmarks of Multimodal Large Language Models☆83Updated 3 weeks ago