AgiBot-World / VideoDatasetLinks
A GPU-accelerated library that enables random frame access and efficient video decoding for data loading.
☆59Updated this week
Alternatives and similar repositories for VideoDataset
Users that are interested in VideoDataset are comparing it to the libraries listed below
Sorting:
- Codebase for the Recognize Anything Model (RAM)☆88Updated 2 years ago
- A CPU Realtime VLM in 500M. Surpassed Moondream2 and SmolVLM. Training from scratch with ease.☆248Updated 9 months ago
- MiMo-Embodied☆349Updated 2 months ago
- StreamingVLM: Real-Time Understanding for Infinite Video Streams☆865Updated 3 months ago
- Cosmos-Reason2 models understand the physical common sense and generate appropriate embodied decisions in natural language through long c…☆186Updated last week
- ☆72Updated 2 months ago
- Cook up amazing multimodal AI applications effortlessly with MiniCPM-o☆290Updated this week
- Scaling Spatial Intelligence with Multimodal Foundation Models☆160Updated 3 weeks ago
- Use Segment Anything 2, grounded with Florence-2, to auto-label data for use in training vision models.☆134Updated last year
- minisora-DiT, a DiT reproduction based on XTuner from the open source community MiniSora☆40Updated last year
- The official repository of the dots.vlm1 instruct models proposed by rednote-hilab.☆284Updated 4 months ago
- Florence-2☆72Updated 11 months ago
- AutoTrackAnything is a universal, flexible and interactive tool for insane automatic object tracking over thousands of frames. It is deve…☆92Updated last year
- [ICLR'26] Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs☆96Updated last week
- [ICCV2025] Referring any person or objects given a natural language description. Code base for RexSeek and HumanRef Benchmark☆177Updated 3 months ago
- Zero-copy multimodal vector DB with CUDA and CLIP/SigLIP☆64Updated 9 months ago
- Megvii FILE Library - Working with Files in Python same as the standard library☆168Updated 3 weeks ago
- Scaling Vision Pre-Training to 4K Resolution☆221Updated last month
- Rex-Thinker: Grounded Object Refering via Chain-of-Thought Reasoning☆139Updated 7 months ago
- ☆85Updated last month
- 🏄 [ICLR 2025] OVTR: End-to-End Open-Vocabulary Multiple Object Tracking with Transformer☆87Updated 6 months ago
- Implementation for "The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer"☆79Updated 3 months ago
- mllm-npu: training multimodal large language models on Ascend NPUs☆95Updated last year
- [ICCV2023] EgoObjects: A Large-Scale Egocentric Dataset for Fine-Grained Object Understanding☆78Updated 2 years ago
- codewithgpu.com python client package☆20Updated 2 years ago
- [ICCV 2025] Detect Anything 3D in the Wild☆246Updated last month
- RayGen: Multi-Modal Dataset Reinforcement for MobileCLIP and MobileCLIP2☆37Updated 5 months ago
- A light-weight and high-efficient training framework for accelerating diffusion tasks.☆51Updated last year
- ComfyUI YOLO-World Integration☆48Updated last year
- WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens☆201Updated 2 years ago