b4rtaz / distributed-llama
Connect home devices into a powerful cluster to accelerate LLM inference. More devices means faster inference.
β2,028Updated this week
Alternatives and similar repositories for distributed-llama:
Users that are interested in distributed-llama are comparing it to the libraries listed below
- A fast inference library for running LLMs locally on modern consumer-class GPUsβ4,125Updated this week
- Stateful load balancer custom-tailored for llama.cpp ππ¦β742Updated 2 weeks ago
- Pure C++ implementation of several models for real-time chatting on your computer (CPU & GPU)β573Updated this week
- Blazingly fast LLM inference.β5,437Updated this week
- Local AI API Platformβ2,622Updated last week
- Large-scale LLM inference engineβ1,384Updated this week
- Multi-LoRA inference server that scales to 1000s of fine-tuned LLMsβ2,955Updated this week
- AirLLM 70B inference with single 4GB GPUβ5,758Updated 4 months ago
- Tools for merging pretrained large language models.β5,571Updated this week
- Local realtime voice AIβ2,279Updated last month
- Go ahead and axolotl questionsβ9,137Updated this week
- Distributed LLM and StableDiffusion inference for mobile, desktop and server.β2,838Updated 6 months ago
- Training LLMs with QLoRA + FSDPβ1,470Updated 5 months ago
- The llama-cpp-agent framework is a tool designed for easy interaction with Large Language Models (LLMs). Allowing users to chat with LLM β¦β551Updated 2 months ago
- Distributed Training Over-The-Internetβ901Updated 4 months ago
- On-device AI across mobile, embedded and edge for PyTorchβ2,747Updated this week
- llama and other large language models on iOS and MacOS offline using GGML library.β1,734Updated last month
- A framework for serving and evaluating LLM routers - save LLM costs without compromising qualityβ3,817Updated 8 months ago
- LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speeβ¦β2,889Updated 5 months ago
- A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.β2,863Updated last year
- NVIDIA Linux open GPU with P2P supportβ1,094Updated 4 months ago
- β‘ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Plβ¦β2,169Updated 6 months ago
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLMβ1,237Updated this week
- MobileLLM Optimizing Sub-billion Parameter Language Models for On-Device Use Cases. In ICML 2024.β1,288Updated this week
- Calculate token/s & GPU memory requirement for any LLM. Supports llama.cpp/ggml/bnb/QLoRA quantizationβ1,285Updated 4 months ago
- An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.