sieve-community / fast-asd
an optimized, production-ready implementation of active speaker detection
β59Updated 9 months ago
Alternatives and similar repositories for fast-asd:
Users that are interested in fast-asd are comparing it to the libraries listed below
- EdgeSAM model for use with Autodistill.β26Updated 8 months ago
- Demo python script app to interact with llama.cpp server using whisper API, microphone and webcam devices.β46Updated last year
- Unofficial implementation and experiments related to Set-of-Mark (SoM) ποΈβ84Updated last year
- Use Florence 2 to auto-label data for use in training fine-tuned object detection models.β62Updated 6 months ago
- β14Updated last year
- Provide Gradio custom components to make the diarization-based audio labeling process easier and faster.β60Updated last week
- Efficient approach to speaker diarization using voice characteristics extractionβ90Updated 10 months ago
- β36Updated last year
- VoiceRestore: Flow-Matching Transformers for Universal Speech Restorationβ149Updated last month
- ClickDiffusion: Harnessing LLMs for Interactive Precise Image Editingβ67Updated 9 months ago
- A project that optimizes Whisper for low latency inference using NVIDIA TensorRTβ73Updated 4 months ago
- β253Updated 11 months ago
- Incredibly descriptive audiovisual summaries for videosβ40Updated 7 months ago
- repo for active speaker detection for media videos.β25Updated last year
- Video+code lecture on building nanoGPT from scratchβ65Updated 8 months ago
- A real-time video caption to conversation bot that captures frames generates captions and creates conversational responses using a Large β¦β123Updated last year
- Accurately locating each head's position in the crowd scenes is a crucial task in the field of crowd analysis. However, traditional densiβ¦β21Updated 11 months ago
- β69Updated 5 months ago
- Use the Moondream 2 model to detect faces and their gaze directions in videos.β39Updated last month
- β11Updated 3 years ago
- Implementation of VisionLLaMA from the paper: "VisionLLaMA: A Unified LLaMA Interface for Vision Tasks" in PyTorch and Zetaβ16Updated 3 months ago
- Use Grounding DINO, Segment Anything, and GPT-4V to label images with segmentation masks for use in training smaller, fine-tuned models.β66Updated last year
- β62Updated 7 months ago
- The open source implementation of "NeVA: NeMo Vision and Language Assistant"β18Updated last year
- A high-throughput and memory-efficient inference and serving engine for Whisper, https://mesolitica.com/blog/vllm-whisperβ24Updated 7 months ago
- VLM driven tool that processes surveillance videos, extracts frames, and generates insightful annotations using a fine-tuned Florence-2 Vβ¦β103Updated 5 months ago
- Create topological graph for image segments.β21Updated 5 months ago
- Use Segment Anything 2, grounded with Florence-2, to auto-label data for use in training vision models.β116Updated 7 months ago
- Our idea is to combine the power of computer vision model and LLMs. We use YOLO, CLIP and DINOv2 to extract high-level features from imagβ¦β110Updated last year