huggingface / fineVideoLinks

☆77

Alternatives and similar repositories for fineVideo

Users that are interested in fineVideo are comparing it to the libraries listed below

Sorting:

mfarre / Video-LLaVA-7B-hf-CinePile
Video-LlaVA fine-tune for CinePile evaluation
☆51Updated 11 months ago
FreedomIntelligence / LongLLaVA
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
☆206Updated 6 months ago
Gengzigang / TokenSet
Official PyTorch implementation of TokenSet.
☆121Updated 3 months ago
facebookresearch / unibench
Python Library to evaluate VLM models' robustness across diverse benchmarks
☆208Updated last week
sayakpaul / simple-image-recaptioning
Recaption large (Web)Datasets with vllm and save the artifacts.
☆52Updated 7 months ago
kyegomez / Sora
Implementation of the premier Text to Video model from OpenAI
☆57Updated 8 months ago
facebookresearch / EvalGIM
🦾 EvalGIM (pronounced as "EvalGym") is an evaluation library for generative image models. It enables easy-to-use, reproducible automatic…
☆82Updated 6 months ago
mbzuai-oryx / PALO
(WACV 2025 - Oral) Vision-language conversation in 10 languages including English, Chinese, French, Spanish, Russian, Japanese, Arabic, H…
☆84Updated 5 months ago
mshukor / UnIVAL
[TMLR23] Official implementation of UnIVAL: Unified Model for Image, Video, Audio and Language Tasks.
☆228Updated last year
EvolvingLMMs-Lab / Aero-1
☆74Updated 2 months ago
AskYoutubeAI / AskVideos-VideoCLIP
☆78Updated 9 months ago
bytedance / Shot2Story
A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.
☆144Updated 5 months ago
lucidrains / maskbit-pytorch
Implementation of the proposed MaskBit from Bytedance AI
☆82Updated 8 months ago
chenllliang / DnD-Transformer
[ICLR 2025] Source code for paper "A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegr…
☆76Updated 7 months ago
neulab / Pangea
This is the repo for the paper "PANGEA: A FULLY OPEN MULTILINGUAL MULTIMODAL LLM FOR 39 LANGUAGES"
☆108Updated 3 weeks ago
google / imageinwords
Data release for the ImageInWords (IIW) paper.
☆217Updated 8 months ago
lucidrains / titok-pytorch
Implementation of TiTok, proposed by Bytedance in "An Image is Worth 32 Tokens for Reconstruction and Generation"
☆175Updated last year
hyc2026 / StoryTeller
☆75Updated 4 months ago
lucidrains / multimodal-dit-pytorch
Implementation of a multimodal diffusion transformer in Pytorch
☆102Updated last year
SHI-Labs / CuMo
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
☆151Updated last year
jiasenlu / LL3M
LL3M: Large Language and Multi-Modal Model in Jax
☆72Updated last year
RWKV / RWKV-LM
RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best…
☆49Updated 4 months ago
lucasjinreal / ImageTokenizer
imagetokenizer is a python package, helps you encoder visuals and generate visuals token ids from codebook, supports both image and video…
☆34Updated last year
aimagelab / LLaVA-MORE
LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning
☆140Updated 2 months ago
tulip-berkeley / open_clip
An open source implementation of CLIP (With TULIP Support)
☆160Updated 2 months ago
MILVLG / imp
a family of highly capabale yet efficient large multimodal models
☆185Updated 10 months ago
UCSC-VLAA / Recap-DataComp-1B
[ICML 2025] This is the official repository of our paper "What If We Recaption Billions of Web Images with LLaMA-3 ?"
☆136Updated last year
itsnamgyu / block-transformer
Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)
☆159Updated 3 months ago
kongds / E5-V
E5-V: Universal Embeddings with Multimodal Large Language Models
☆257Updated 6 months ago
MBZUAI-LLM / web2code
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
☆85Updated 8 months ago