LLM Inference on consumer devices
ā130Mar 17, 2025Updated 11 months ago
Alternatives and similar repositories for UMbreLLa
Users that are interested in UMbreLLa are comparing it to the libraries listed below
Sorting:
- ā66Nov 4, 2024Updated last year
- š FlexLLama - Lightweight self-hosted tool for running multiple llama.cpp server instances with OpenAI v1 API compatibility and multi-GPā¦ā50Feb 17, 2026Updated last week
- Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.ā17Feb 9, 2026Updated 2 weeks ago
- Adding a multi-text multi-speaker script (diffe) that is based on a script from asiff00 on issue 61 for Sesame: A Conversational Speech Gā¦ā26Mar 28, 2025Updated 11 months ago
- ā24Jan 22, 2025Updated last year
- Simple python library for generating your own perfetto traces for your application. Can be used for both app instrumentation and custom ā¦ā25Jun 22, 2025Updated 8 months ago
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Lengthā147Dec 23, 2025Updated 2 months ago
- A local-first LLM development studio. Build, test, and customize inference workflows with your own models ā no cloud, totally local.ā17May 21, 2025Updated 9 months ago
- ā17Dec 16, 2024Updated last year
- scalable and robust tree-based speculative decoding algorithmā370Jan 28, 2025Updated last year
- AI management toolā121Nov 9, 2024Updated last year
- A forward proxy to turn network traffic into personal memory for AI agentsā36Updated this week
- My submission for the GPUMODE/AMD fp8 mm challengeā29Jun 4, 2025Updated 8 months ago
- Lightweight Python Wrapper for OpenVINO, enabling LLM inference on NPUsā27Dec 17, 2024Updated last year
- A TTS model capable of generating ultra-realistic dialogue in one pass.ā31May 1, 2025Updated 10 months ago
- A cross platform App that gives you the best UX to run models locally or remotely on your own hardwareā73Dec 22, 2025Updated 2 months ago
- ā20Nov 26, 2025Updated 3 months ago
- Fast Hadamard transform in CUDA, with a PyTorch interfaceā285Oct 19, 2025Updated 4 months ago
- Super simple python connectors for llama.cpp, including vision models (Gemma 3, Qwen2-VL). Compile llama.cpp and run!ā29Dec 11, 2025Updated 2 months ago
- Your personal ArXiv Feedā23Dec 18, 2024Updated last year
- Cleanai (https://github.com/willmil11/cleanai) except I'm making it in c now. Fast and clean from the start this time :)ā17Feb 5, 2026Updated 3 weeks ago
- A local RAG pipeline that passed a Japanese corporate examā24May 7, 2025Updated 9 months ago
- a simple API to use CUPTIā11Aug 19, 2025Updated 6 months ago
- Yet another frontend for LLM, written using .NET and WinUI 3ā10Sep 14, 2025Updated 5 months ago
- ā15Apr 9, 2025Updated 10 months ago
- FlexAudioPrint is a Python-based app for transcribing audio to text using OpenAI's Whisper model. It offers a Gradio web interface and a ā¦ā10Jan 29, 2026Updated last month
- Create text chunks which end at natural stopping points without using a tokenizerā26Nov 26, 2025Updated 3 months ago
- Fully autonomous AI development agent with stateful memory, live web access, pseudo self-improvement and more!ā66Feb 9, 2026Updated 2 weeks ago
- ā27Jun 11, 2025Updated 8 months ago
- Official Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)ā54Mar 14, 2025Updated 11 months ago
- AI debugger and AI coder integrated. Use AI to code and drives runtime debuggerā83Nov 25, 2025Updated 3 months ago
- Glyphs, acting as collaboratively defined symbols linking related concepts, add a layer of multidimensional semantic richness to user-AI ā¦ā56Feb 10, 2025Updated last year
- Scripts and tools for optimizing quantizations in llama.cpp with GGUF imatrices.ā18Jan 10, 2025Updated last year
- An interface that features barely zero external dependencies beyond the Ollama API itself, making it lightweight and portable to easily iā¦ā12Mar 25, 2025Updated 11 months ago
- This is official project in our paper: Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and Layersā31Jan 13, 2024Updated 2 years ago
- PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculationā32Nov 16, 2024Updated last year
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Modelsā327Nov 26, 2025Updated 3 months ago
- Open source tool for transcirption and subtitling, alternative to happyscribe.ā33Feb 12, 2025Updated last year
- A lightweight LLaMA.cpp HTTP server Docker image based on Alpine Linux.ā29Oct 3, 2025Updated 4 months ago