zhongyy / VIoTGPT
Code of AAAI2025 Paper 《VIoTGPT: Learning to Schedule Vision Tools in LLMs towards Intelligent Video Internet of Things》
☆11Updated 2 weeks ago
Alternatives and similar repositories for VIoTGPT:
Users that are interested in VIoTGPT are comparing it to the libraries listed below
- LMM which strictly superset LLM embedded☆37Updated 2 months ago
- Video dataset dedicated to portrait-mode video recognition.☆43Updated last month
- [NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effect…☆35Updated 7 months ago
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model☆41Updated 5 months ago
- Precision Search through Multi-Style Inputs☆62Updated 6 months ago
- ECCV2024_Parrot Captions Teach CLIP to Spot Text☆63Updated 4 months ago
- ☆19Updated last year
- The official repo for “TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding”.☆38Updated 4 months ago
- ☆37Updated 11 months ago
- A Framework for Decoupling and Assessing the Capabilities of VLMs☆40Updated 7 months ago
- ☆19Updated last year
- [PR 2024] A large Cross-Modal Video Retrieval Dataset with Reading Comprehension☆24Updated last year
- Multimodal Open-O1 (MO1) is designed to enhance the accuracy of inference models by utilizing a novel prompt-based approach. This tool wo…☆29Updated 4 months ago
- [WIP@Oct 13] 质衡-基准测试 (Q-Bench in Chinese),包含中文版【底层视觉问答】和【底层视觉描述】数据集,以及中文提示下的图片质量评价。 We will release Q-Bench in more languages in the futu…☆20Updated last year
- ☆32Updated 8 months ago
- Official implementation of TagAlign☆34Updated last month
- [IJCV 2025] Code for DeepFake-Adapter: Dual-Level Adapter for DeepFake Detection☆48Updated last month
- [CVPR 2024] DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model☆17Updated 9 months ago
- [EMNLP 2024] Official code for "Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models"☆14Updated 3 months ago
- ☆9Updated 3 weeks ago
- ☆40Updated 6 months ago
- Official Implementation of HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing☆21Updated last month
- LLaVA combines with Magvit Image tokenizer, training MLLM without an Vision Encoder. Unifying image understanding and generation.☆35Updated 7 months ago
- ☆47Updated last month
- ☆15Updated 3 months ago
- An interactive demo based on Segment-Anything for stroke-based painting which enables human-like painting.☆34Updated last year
- A multimodal large-scale model, which performs close to the closed-source Qwen-VL-PLUS on many datasets and significantly surpasses the p…☆14Updated 11 months ago
- Official implementation of High Fidelity Scene Text Synthesis.☆46Updated last month
- Official repo: SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing☆51Updated 9 months ago
- [NeurIPS 2024] VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models☆128Updated 4 months ago