qiulu66 / EgoPlan-Bench2
☆19Updated last week
Alternatives and similar repositories for EgoPlan-Bench2:
Users that are interested in EgoPlan-Bench2 are comparing it to the libraries listed below
- Latent Motion Token as the Bridging Language for Robot Manipulation☆52Updated last week
- [NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos☆79Updated this week
- [ECCV 2024] ControlCap: Controllable Region-level Captioning☆56Updated last month
- FQGAN: Factorized Visual Tokenization and Generation☆36Updated 2 weeks ago
- Liquid: Language Models are Scalable Multi-modal Generators☆30Updated this week
- Egocentric Video Understanding Dataset (EVUD)☆24Updated 5 months ago
- Can 3D Vision-Language Models Truly Understand Natural Language?☆21Updated 8 months ago
- 👾 E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding (NeurIPS 2024)☆44Updated last month
- IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks☆59Updated 2 months ago
- ☆58Updated last year
- VisualGPTScore for visio-linguistic reasoning☆26Updated last year
- [ECCV 2024] OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Models☆34Updated 2 months ago
- IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model☆26Updated 3 weeks ago
- ☆33Updated 8 months ago
- Diffusion Powers Video Tokenizer for Comprehension and Generation☆34Updated last week
- Open implementation of "RandAR"☆41Updated last week
- [NeurIPS 2024 D&B Track] Official Repo for "LVD-2M: A Long-take Video Dataset with Temporally Dense Captions"☆42Updated 2 months ago
- Code Release of F-LMM: Grounding Frozen Large Multimodal Models☆57Updated 4 months ago
- (NeurIPS 2024 Spotlight) TOPA: Extend Large Language Models for Video Understanding via Text-Only Pre-Alignment☆25Updated 2 months ago
- (ICCV 2023) Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation☆45Updated 5 months ago
- ☆33Updated last month
- Code for paper "Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning"☆23Updated last year
- [ECCV2024] Learning Video Context as Interleaved Multimodal Sequences☆32Updated 2 months ago
- TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models☆25Updated last month
- EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation☆85Updated last month
- [CVPR2022 Oral] 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds☆52Updated last year
- Repository of paper: Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models☆37Updated last year
- [ECCV2022] A PyTorch implementation of the paper "Spatial and Visual Perspective-Taking via View Rotation and Relation Reasoning for Embo…☆13Updated last year
- 🔥 [CVPR 2024] Official implementation of "See, Say, and Segment: Teaching LMMs to Overcome False Premises (SESAME)"☆30Updated 6 months ago
- Implementation of paper 'Helping Hands: An Object-Aware Ego-Centric Video Recognition Model'☆31Updated last year