b4rtaz / distributed-llamaLinks
Distributed LLM inference. Connect home devices into a powerful cluster to accelerate LLM inference. More devices means faster inference.
β2,774Updated last month
Alternatives and similar repositories for distributed-llama
Users that are interested in distributed-llama are comparing it to the libraries listed below
Sorting:
- Open-source LLM load balancer and serving platform for self-hosting LLMs at scale ππ¦β1,414Updated this week
- WebAssembly binding for llama.cpp - Enabling on-browser LLM inferenceβ965Updated 2 weeks ago
- A fast inference library for running LLMs locally on modern consumer-class GPUsβ4,399Updated 3 weeks ago
- Large-scale LLM inference engineβ1,610Updated last month
- Local AI API Platformβ2,763Updated 6 months ago
- llama.cpp fork with additional SOTA quants and improved performanceβ1,407Updated this week
- Reliable model swapping for any local OpenAI/Anthropic compatible server - llama.cpp, vllm, etcβ2,123Updated this week
- Pure C++ implementation of several models for real-time chatting on your computer (CPU & GPU)β758Updated last week
- Distributed LLM and StableDiffusion inference for mobile, desktop and server.β2,905Updated last year
- VS Code extension for LLM-assisted code/text completionβ1,116Updated this week
- MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.β1,976Updated this week
- NVIDIA Linux open GPU with P2P supportβ1,305Updated 6 months ago
- Llama 2 Everywhere (L2E)β1,525Updated 4 months ago
- Lightweight inference library for ONNX files, written in C++. It can run Stable Diffusion XL 1.0 on a RPI Zero 2 (or in 298MB of RAM) butβ¦β2,015Updated last month
- TinyChatEngine: On-Device LLM Inference Libraryβ935Updated last year
- Text-To-Speech, RAG, and LLMs. All local!β1,848Updated last year
- The llama-cpp-agent framework is a tool designed for easy interaction with Large Language Models (LLMs). Allowing users to chat with LLM β¦β610Updated 10 months ago
- β‘ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Plβ¦β2,170Updated last year
- A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.β2,905Updated 2 years ago
- SCUDA is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.β1,792Updated 6 months ago
- Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference?β1,862Updated last year
- Blazingly fast LLM inference.β6,318Updated this week
- Local realtime voice AIβ2,411Updated last month
- Distributed Training Over-The-Internetβ973Updated 2 months ago
- The official API server for Exllama. OAI compatible, lightweight, and fast.β1,103Updated 2 weeks ago
- A framework for serving and evaluating LLM routers - save LLM costs without compromising qualityβ4,502Updated last year
- Big & Small LLMs working togetherβ1,234Updated this week
- Python bindings for the Transformer models implemented in C/C++ using GGML library.β1,877Updated last year
- Optimizing inference proxy for LLMsβ3,252Updated last week
- Simple go utility to download HuggingFace Models and Datasetsβ798Updated this week