leixy20 / Scaffold
Scaffold Prompting to promote LMMs
☆28Updated 4 months ago
Related projects: ⓘ
- Code for "Interactive Task Planning with Language Models"☆24Updated 11 months ago
- [NeurIPS2023] Official implementation of the paper "Large Language Models are Visual Reasoning Coordinators"☆99Updated 10 months ago
- Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision☆44Updated 2 months ago
- Official implementation for CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding☆40Updated 10 months ago
- Language Repository for Long Video Understanding☆27Updated 3 months ago
- ☆19Updated 3 months ago
- [CVPR'24 Highlight] The official code and data for paper "EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Lan…☆45Updated this week
- official repo of "VideoGUI: A Benchmark for GUI Automation from Instructional Videos"☆19Updated 3 months ago
- Official implementation of the paper "Interfacing Foundation Models' Embeddings"☆107Updated 3 weeks ago
- [COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs☆112Updated 3 weeks ago
- ☆60Updated 2 months ago
- ☆53Updated 2 months ago
- Multimodal Video Understanding Framework (MVU)☆23Updated 4 months ago
- This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.or…☆100Updated 2 months ago
- ☆55Updated 2 months ago
- ☆46Updated 10 months ago
- Official implementation of the paper "MMInA: Benchmarking Multihop Multimodal Internet Agents"☆36Updated 5 months ago
- ☆14Updated last week
- VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".☆73Updated 2 months ago
- [ICLR 2023] SQA3D for embodied scene understanding and reasoning☆115Updated 11 months ago
- This repository is a collection of research papers on World Models.☆28Updated last year
- Code and datasets for "What’s “up” with vision-language models? Investigating their struggle with spatial reasoning".☆32Updated 6 months ago
- [ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, …☆75Updated 3 weeks ago
- ☆36Updated 5 months ago
- ☆22Updated last month
- Code for MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World☆115Updated 6 months ago
- Towards Large Multimodal Models as Visual Foundation Agents☆87Updated 3 weeks ago
- Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning☆93Updated 2 months ago
- [CVPR2024] This is the official implement of MP5☆72Updated 2 months ago