leixy20 / Scaffold

Scaffold Prompting to promote LMMs

☆28

Related projects: ⓘ

Boyiliee / ITP-BobaRobot
Code for "Interactive Task Planning with Language Models"
☆24Updated 11 months ago
cliangyu / Cola
[NeurIPS2023] Official implementation of the paper "Large Language Models are Visual Reasoning Coordinators"
☆99Updated 10 months ago
orrzohar / Video-STaR
Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision
☆44Updated 2 months ago
UMass-Foundation-Model / CoVLM
Official implementation for CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding
☆40Updated 10 months ago
kkahatapitiya / LangRepo
Language Repository for Long Video Understanding
☆27Updated 3 months ago
clova-tool / CLOVA-tool
☆19Updated 3 months ago
AdaCheng / EgoThink
[CVPR'24 Highlight] The official code and data for paper "EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Lan…
☆45Updated this week
showlab / videogui
official repo of "VideoGUI: A Benchmark for GUI Automation from Instructional Videos"
☆19Updated 3 months ago
UX-Decoder / FIND
Official implementation of the paper "Interfacing Foundation Models' Embeddings"
☆107Updated 3 weeks ago
zzxslp / SoM-LLaVA
[COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
☆112Updated 3 weeks ago
allenai / unified-io-2.pytorch
☆60Updated 2 months ago
ChenYi99 / EgoPlan
☆53Updated 2 months ago
kahnchana / mvu
Multimodal Video Understanding Framework (MVU)
☆23Updated 4 months ago
zeyofu / BLINK_Benchmark
This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.or…
☆100Updated 2 months ago
Max-Fu / tvl
☆55Updated 2 months ago
mlfoundations / VisIT-Bench
☆46Updated 10 months ago
shulin16 / MMInA
Official implementation of the paper "MMInA: Benchmarking Multihop Multimodal Internet Agents"
☆36Updated 5 months ago
YuxiangChai / AMEX-codebase
☆14Updated last week
Yxxxb / VoCo-LLaMA
VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".
☆73Updated 2 months ago
SilongYong / SQA3D
[ICLR 2023] SQA3D for embodied scene understanding and reasoning
☆115Updated 11 months ago
PatrickHua / Awesome-World-Models
This repository is a collection of research papers on World Models.
☆28Updated last year
amitakamath / whatsup_vlms
Code and datasets for "What’s “up” with vision-language models? Investigating their struggle with spatial reasoning".
☆32Updated 6 months ago
llyx97 / TempCompass
[ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, …
☆75Updated 3 weeks ago
EmbodiedGPT / EgoCOT_Dataset
☆36Updated 5 months ago
rohan598 / ConTextual
☆22Updated last month
UMass-Foundation-Model / MultiPLY
Code for MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World
☆115Updated 6 months ago
THUDM / VisualAgentBench
Towards Large Multimodal Models as Visual Foundation Agents
☆87Updated 3 weeks ago
deepcs233 / Visual-CoT
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
☆93Updated 2 months ago
IranQin / MP5
[CVPR2024] This is the official implement of MP5
☆72Updated 2 months ago