showlab / assistgui
β23Updated 6 months ago
Related projects β
Alternatives and complementary repositories for assistgui
- π» A curated list of papers and resources for multi-modal Graphical User Interface (GUI) agents.β178Updated 2 weeks ago
- GUI Odyssey is a comprehensive dataset for training and evaluating cross-app navigation agents. GUI Odyssey consists of 7,735 episodes frβ¦β64Updated 4 months ago
- [COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMsβ123Updated 2 months ago
- The model, data and code for the visual GUI Agent SeeClickβ216Updated 2 months ago
- β131Updated 10 months ago
- This is the official code of VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding (ECCV 2024)β128Updated 2 months ago
- GUICourse: From General Vision Langauge Models to Versatile GUI Agentsβ78Updated 3 months ago
- PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Modelsβ242Updated 10 months ago
- β121Updated last week
- Official code for Paper "Mantis: Multi-Image Instruction Tuning"β179Updated last week
- Long Context Transfer from Language to Visionβ328Updated 2 weeks ago
- [NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Modelsβ227Updated last month
- [CVPR'2024 Highlight] Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".β224Updated 4 months ago
- This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"β127Updated 3 months ago
- [Neurips'24 Spotlight] Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought β¦β132Updated last month
- [ECCV2024] Official code implementation of Merlin: Empowering Multimodal LLMs with Foresight Mindsβ82Updated 4 months ago
- (2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understandingβ241Updated 3 months ago
- β120Updated last month
- β287Updated 9 months ago
- VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Modelsβ27Updated 3 months ago
- Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridgesβ48Updated last month
- Official implementation of WebVLN: Vision-and-Language Navigation on Websitesβ23Updated 10 months ago
- β126Updated last week
- β¨β¨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Modelsβ137Updated this week
- [NeurIPS2024] VideoGUI: A Benchmark for GUI Automation from Instructional Videosβ21Updated 3 weeks ago
- Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"β163Updated 2 months ago
- β65Updated last year
- [ECCV 2024π₯] Official implementation of the paper "ST-LLM: Large Language Models Are Effective Temporal Learners"β125Updated 2 months ago
- [ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, β¦β83Updated 3 weeks ago
- LVBench: An Extreme Long Video Understanding Benchmarkβ59Updated 2 months ago