BeingBeyond / Being-VL-0.5Links
Being-VL-0.5: Unified Multimodal Understanding via Byte-Pair Visual Encoding
☆23Updated last month
Alternatives and similar repositories for Being-VL-0.5
Users that are interested in Being-VL-0.5 are comparing it to the libraries listed below
Sorting:
- [arXiv: 2502.05178] QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation☆86Updated 6 months ago
- ☆71Updated 9 months ago
- Egocentric Video Understanding Dataset (EVUD)☆31Updated last year
- ☆80Updated last month
- ElasticTok: Adaptive Tokenization for Image and Video☆75Updated 10 months ago
- [ICLR'25] Reconstructive Visual Instruction Tuning☆106Updated 4 months ago
- [ECCV2024, Oral, Best Paper Finalist] This is the official implementation of the paper "LEGO: Learning EGOcentric Action Frame Generation…☆37Updated 6 months ago
- ☆77Updated last year
- [ICLR2025] Official code implementation of Video-UTR: Unhackable Temporal Rewarding for Scalable Video MLLMs☆58Updated 6 months ago
- ☆218Updated 3 weeks ago
- VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation☆86Updated 11 months ago
- Code for MetaMorph Multimodal Understanding and Generation via Instruction Tuning☆207Updated 4 months ago
- [ICCV2025 Oral] Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos☆122Updated 3 months ago
- [NeurIPS2024] Official code for (IMA) Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs☆21Updated 10 months ago
- The official repository for our paper, "Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning".☆137Updated last month
- [ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark☆125Updated 3 months ago
- [ECCV2024] Official code implementation of Merlin: Empowering Multimodal LLMs with Foresight Minds☆94Updated last year
- Official repository for "iVideoGPT: Interactive VideoGPTs are Scalable World Models" (NeurIPS 2024), https://arxiv.org/abs/2405.15223☆141Updated 3 months ago
- Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces☆80Updated 2 months ago
- ☆48Updated 2 weeks ago
- ☆153Updated 10 months ago
- Long-RL: Scaling RL to Long Sequences☆597Updated 2 weeks ago
- [ICLR 2025] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation☆382Updated 4 months ago
- [Neurips 24' D&B] Official Dataloader and Evaluation Scripts for LongVideoBench.☆107Updated last year
- Uni-CoT: Towards Unified Chain-of-Thought Reasoning Across Text and Vision☆101Updated 3 weeks ago
- [EMNLP 2024] A Video Chat Agent with Temporal Prior☆32Updated 6 months ago
- Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning☆110Updated 2 weeks ago
- ☆88Updated 2 months ago
- ☆38Updated 6 months ago
- [ICLR 2025] Official implementation and benchmark evaluation repository of <PhysBench: Benchmarking and Enhancing Vision-Language Models …☆67Updated 3 months ago