rese1f / STEVE
βπ STEVE in Minecraft is for See and Think: Embodied Agent in Virtual Environment
β30Updated 10 months ago
Related projects β
Alternatives and complementary repositories for STEVE
- [NIPS24W]This repo is the official implementation of "MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulatedβ¦β73Updated 4 months ago
- [CVPR2024] This is the official implement of MP5β84Updated 4 months ago
- Official implementation of "Self-Improving Video Generation"β52Updated last week
- π₯ Aurora Series: A more efficient multimodal large language model series for video.β47Updated this week
- β40Updated 11 months ago
- β61Updated last month
- VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".β82Updated 4 months ago
- Official repository of S-Agents: Self-organizing Agents in Open-ended Environmentβ17Updated 8 months ago
- Code and Data for Paper: PanoGen: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigationβ73Updated last year
- β29Updated last week
- Official repo for StableLLAVAβ91Updated 10 months ago
- GROOT: Learning to Follow Instructions by Watching Gameplay Videosβ56Updated 11 months ago
- [CVPR'24 Highlight] The official code and data for paper "EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Lanβ¦β48Updated 2 weeks ago
- Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervisionβ47Updated 4 months ago
- Video Generation, Physical Commonsense, Semantic Adherence, VideoCon-Physicsβ55Updated last month
- βοΈ CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusionβ27Updated 5 months ago
- Official implementation of the paper "MMInA: Benchmarking Multihop Multimodal Internet Agents"β38Updated 7 months ago
- [NeurIPS2023] Official implementation of the paper "Large Language Models are Visual Reasoning Coordinators"β102Updated last year
- [ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, β¦β89Updated last week
- [NeurIPS 2024] Efficient Multi-modal Models via Stage-wise Visual Context Compressionβ39Updated 3 months ago
- β75Updated this week
- Official implement of MIA-DPOβ40Updated 2 weeks ago
- Official implementation of WebVLN: Vision-and-Language Navigation on Websitesβ23Updated 10 months ago
- [NeurIPS 2024 D&B Track] Official Repo for "LVD-2M: A Long-take Video Dataset with Temporally Dense Captions"β36Updated last month
- [NeurIPS-2024] The offical Implementation of "Instruction-Guided Visual Masking"β29Updated this week
- Empowering Unified MLLM with Multi-granular Visual Generationβ106Updated last month
- This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.orβ¦β107Updated 4 months ago
- The official code of the paper "PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction".β44Updated 3 weeks ago
- T2VScore: Towards A Better Metric for Text-to-Video Generationβ78Updated 7 months ago