huggingface / fineVideoLinks
β84Updated 11 months ago
Alternatives and similar repositories for fineVideo
Users that are interested in fineVideo are comparing it to the libraries listed below
Sorting:
- Implementation of TiTok, proposed by Bytedance in "An Image is Worth 32 Tokens for Reconstruction and Generation"β178Updated last year
- π¦Ύ EvalGIM (pronounced as "EvalGym") is an evaluation library for generative image models. It enables easy-to-use, reproducible automaticβ¦β82Updated 8 months ago
- Implementation of the proposed MaskBit from Bytedance AIβ82Updated 9 months ago
- [ICLR 2025] Source code for paper "A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrβ¦β77Updated 8 months ago
- β77Updated 3 months ago
- Official PyTorch implementation of TokenSet.β121Updated 5 months ago
- Data release for the ImageInWords (IIW) paper.β218Updated 9 months ago
- Video-LlaVA fine-tune for CinePile evaluationβ51Updated last year
- [ICML 2025] This is the official repository of our paper "What If We Recaption Billions of Web Images with LLaMA-3 ?"β138Updated last year
- Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models. TMLR 2025.β93Updated 3 months ago
- Implementation of a multimodal diffusion transformer in Pytorchβ103Updated last year
- M4 experiment logbookβ58Updated 2 years ago
- Official PyTorch Implementation for Paper "No More Adam: Learning Rate Scaling at Initialization is All You Need"β52Updated 7 months ago
- Python Library to evaluate VLM models' robustness across diverse benchmarksβ210Updated 2 weeks ago
- LL3M: Large Language and Multi-Modal Model in Jaxβ74Updated last year
- Recaption large (Web)Datasets with vllm and save the artifacts.β52Updated 9 months ago
- β80Updated 11 months ago
- This is the repo for the paper "PANGEA: A FULLY OPEN MULTILINGUAL MULTIMODAL LLM FOR 39 LANGUAGES"β110Updated 2 months ago
- EILeV: Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Propertiesβ130Updated 9 months ago
- Educational repository for applying the main video data curation techniques presented in the Stable Video Diffusion paper.β80Updated last year
- β78Updated 5 months ago
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architectureβ210Updated 7 months ago
- Code for the paper "Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers" [ICCV 2025]β83Updated last month
- Supercharged BLIP-2 that can handle videosβ120Updated last year
- This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.orβ¦β136Updated last year
- A big_vision inspired repo that implements a generic Auto-Encoder class capable in representation learning and generative modeling.β34Updated last year
- Quick Long Video Understandingβ62Updated 2 months ago
- Official Implementation of LaViDa: :A Large Diffusion Language Model for Multimodal Understandingβ136Updated last month
- Official Implementation for "MyVLM: Personalizing VLMs for User-Specific Queries" (ECCV 2024)β178Updated last year
- An open source implementation of CLIP (With TULIP Support)β162Updated 3 months ago