sovit-123 / SAM_Molmo_Whisper
An integration of Segment Anything Model, Molmo, and, Whisper to segment objects using voice and natural language.
☆12Updated 3 weeks ago
Alternatives and similar repositories for SAM_Molmo_Whisper:
Users that are interested in SAM_Molmo_Whisper are comparing it to the libraries listed below
- Use Florence 2 to auto-label data for use in training fine-tuned object detection models.☆60Updated 4 months ago
- EdgeSAM model for use with Autodistill.☆26Updated 6 months ago
- Use Grounding DINO, Segment Anything, and GPT-4V to label images with segmentation masks for use in training smaller, fine-tuned models.☆65Updated last year
- Real-time, YOLO-like object detection using the Florence-2-base-ft model with a user-friendly GUI.☆15Updated last week
- OcSort-Pip: Packaged version of the OcSort repository☆14Updated 2 years ago
- OLA-VLM: Elevating Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv 2024☆44Updated 3 weeks ago
- A simple demo for utilizing grounding dino and segment anything v2 models together☆19Updated 5 months ago
- Notebook and Scripts that showcase running quantized diffusion models on consumer GPUs☆37Updated 2 months ago
- Python scripts performing optical flow estimation using the NeuFlowV2 model in ONNX.☆38Updated 3 months ago
- Simple CogVLM client script☆14Updated last year
- Lightweight models for real-time semantic segmentationon PyTorch (include SQNet, LinkNet, SegNet, UNet, ENet, ERFNet, EDANet, ESPNet, ESP…☆11Updated last year
- A list of language models with permissive licenses such as MIT or Apache 2.0☆24Updated 2 months ago
- ☆29Updated last month
- Use Grounding DINO, Segment Anything, and CLIP to label objects in images.☆23Updated last year
- This Repository demostrates various examples using YOLO☆13Updated 11 months ago
- Pixel Parsing. A reproduction of OCR-free end-to-end document understanding models with open data☆21Updated 5 months ago
- Vehicle speed estimation using YOLOv8☆30Updated 9 months ago
- ☆23Updated 2 months ago
- Chat with Qwen2-VL. Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.☆9Updated 3 months ago
- ClickDiffusion: Harnessing LLMs for Interactive Precise Image Editing☆66Updated 7 months ago
- ☆30Updated last year
- Official code repository for paper: "ExPLoRA: Parameter-Efficient Extended Pre-training to Adapt Vision Transformers under Domain Shifts"☆28Updated 3 months ago
- Coding an LLM and its building blocks from scratch.☆15Updated 3 weeks ago
- Testing and evaluating the capabilities of Vision-Language models (PaliGemma) in performing computer vision tasks such as object detectio…☆79Updated 7 months ago
- Code Repository for Blog - How to Productionize Large Language Models (LLMs)☆11Updated 9 months ago
- Fine Tuning Multimodal LLM "Idefics 9B" on Pokemon Go Dataset available on Hugging Face.☆19Updated 11 months ago
- Code and pretrained models for the paper: "MatMamba: A Matryoshka State Space Model"☆54Updated last month
- GPT-4V(ision) module for use with Autodistill.☆26Updated 5 months ago
- Chat with Phi 3.5/3 Vision LLMs. Phi-3.5-vision is a lightweight, state-of-the-art open multimodal model built upon datasets which includ…☆32Updated last week
- Evaluate the performance of computer vision models and prompts for zero-shot models (Grounding DINO, CLIP, BLIP, DINOv2, ImageBind, model…☆34Updated last year