roboflow / maestro
streamline the fine-tuning process for multimodal models: PaliGemma, Florence-2, and Qwen2-VL
β1,390Updated this week
Related projects β
Alternatives and complementary repositories for maestro
- ποΈ + π¬ + π§ = π€ Curated list of top foundation and multimodal models! [Paper + Code + Examples + Tutorials]β577Updated 8 months ago
- Must-have resource for anyone who wants to experiment with and build on the OpenAI vision API π₯β1,647Updated 8 months ago
- Set-of-Mark Prompting for GPT-4V and LMMsβ1,185Updated 3 months ago
- Recipes for shrinking, optimizing, customizing cutting edge vision models. πβ890Updated 2 months ago
- 4M: Massively Multimodal Masked Modelingβ1,607Updated last month
- β700Updated 8 months ago
- γEMNLP 2024π₯γVideo-LLaVA: Learning United Visual Representation by Alignment Before Projectionβ3,003Updated last month
- A fast, easy-to-use, production-ready inference server for computer vision supporting deployment of many popular model architectures and β¦β1,370Updated this week
- β448Updated 7 months ago
- Mixture-of-Experts for Large Vision-Language Modelsβ1,989Updated 6 months ago
- LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skillsβ705Updated 9 months ago
- Images to inference with no labeling (use foundation models to train supervised models).β1,989Updated 2 weeks ago
- EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anythingβ2,160Updated 5 months ago
- This repository is a curated collection of the most exciting and influential CVPR 2024 papers. π₯ [Paper + Code + Demo]β664Updated 4 months ago
- VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin andβ¦β1,999Updated 3 weeks ago
- TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbonesβ1,251Updated 7 months ago
- ICLR2024 Spotlight: curation/training code, metadata, distribution and pre-trained models for MetaCLIP; CVPR 2024: MoDE: CLIP Data Expertβ¦β1,255Updated this week
- Janus-Series: Unified Multimodal Understanding and Generation Modelsβ1,084Updated last week
- β763Updated this week
- PyTorch code and models for V-JEPA self-supervised learning from video.β2,673Updated 3 months ago
- Accelerate your Hugging Face Transformers 7.6-9x. Native to Hugging Face and PyTorch.β687Updated 2 months ago
- LLaVA-Interactive-Demoβ352Updated 3 months ago
- Official Implementation of CVPR24 highligt paper: Matching Anything by Segmenting Anythingβ1,004Updated 2 weeks ago
- [NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environmentsβ1,397Updated this week
- A novel implementation of fusing ViT with Mamba into a fast, agile, and high performance Multi-Modal Model. Powered by Zeta, the simplestβ¦β438Updated last week
- PyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"β526Updated 10 months ago
- β2,898Updated last month
- Grounded SAM 2: Ground and Track Anything in Videos with Grounding DINO, Florence-2 and SAM 2β1,141Updated 2 weeks ago
- The code used to train and run inference with the ColPali architecture.β1,132Updated this week