huggingface / fineVideo
β71Updated 6 months ago
Alternatives and similar repositories for fineVideo:
Users that are interested in fineVideo are comparing it to the libraries listed below
- Implementation of TiTok, proposed by Bytedance in "An Image is Worth 32 Tokens for Reconstruction and Generation"β168Updated 9 months ago
- π¦Ύ EvalGIM (pronounced as "EvalGym") is an evaluation library for generative image models. It enables easy-to-use, reproducible automaticβ¦β68Updated 3 months ago
- Recaption large (Web)Datasets with vllm and save the artifacts.β48Updated 3 months ago
- Auto Interpretation Pipeline and many other functionalities for Multimodal SAE Analysis.β120Updated last month
- Python Library to evaluate VLM models' robustness across diverse benchmarksβ195Updated this week
- Focused on fast experimentation and simplicityβ69Updated 2 months ago
- β63Updated 5 months ago
- Video-LlaVA fine-tune for CinePile evaluationβ50Updated 7 months ago
- M4 experiment logbookβ57Updated last year
- Multimodal language model benchmark, featuring challenging examplesβ160Updated 3 months ago
- β73Updated 5 months ago
- Implementation of the proposed MaskBit from Bytedance AIβ75Updated 4 months ago
- OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv 2024β57Updated 3 weeks ago
- LL3M: Large Language and Multi-Modal Model in Jaxβ70Updated 10 months ago
- E5-V: Universal Embeddings with Multimodal Large Language Modelsβ234Updated 2 months ago
- [ICLR 2025] Source code for paper "A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrβ¦β68Updated 3 months ago
- Official code for Paper "Mantis: Multi-Image Instruction Tuning" [TMLR2024]β208Updated this week
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)β150Updated 3 months ago
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architectureβ199Updated 2 months ago
- [TMLR] Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"β130Updated 4 months ago
- A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.β124Updated last month
- CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Expertsβ143Updated 9 months ago
- Matryoshka Multimodal Modelsβ98Updated last month
- a family of highly capabale yet efficient large multimodal modelsβ178Updated 6 months ago
- Implementation of π₯₯ Coconut, Chain of Continuous Thought, in Pytorchβ160Updated 2 months ago
- Maya: An Instruction Finetuned Multilingual Multimodal Model using Ayaβ107Updated 3 weeks ago
- Minimal sharded dataset loaders, decoders, and utils for multi-modal document, image, and text datasets.β156Updated 11 months ago
- Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"β145Updated last month