b4rtaz / distributed-llama
Tensor parallelism is all you need. Run LLMs on an AI cluster at home using any device. Distribute the workload, divide RAM usage, and increase inference speed.
β1,690Updated this week
Alternatives and similar repositories for distributed-llama:
Users that are interested in distributed-llama are comparing it to the libraries listed below
- A fast inference library for running LLMs locally on modern consumer-class GPUsβ3,944Updated this week
- Large-scale LLM inference engineβ1,288Updated this week
- Stateful load balancer custom-tailored for llama.cpp ππ¦β704Updated 3 weeks ago
- Pure C++ implementation of several models for real-time chatting on your computer (CPU & GPU)β511Updated this week
- Multi-LoRA inference server that scales to 1000s of fine-tuned LLMsβ2,355Updated this week
- VS Code extension for LLM-assisted code/text completionβ509Updated this week
- An OAI compatible exllamav2 API that's both lightweight and fastβ778Updated this week
- Chat language model that can use tools and interpret the resultsβ1,513Updated this week
- Blazingly fast LLM inference.β4,977Updated this week
- Calculate token/s & GPU memory requirement for any LLM. Supports llama.cpp/ggml/bnb/QLoRA quantizationβ1,220Updated 2 months ago
- β‘ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Plβ¦β2,160Updated 4 months ago
- Distributed Training Over-The-Internetβ878Updated 2 months ago
- Distributed LLM and StableDiffusion inference for mobile, desktop and server.β2,763Updated 3 months ago
- Llama-3 agents that can browse the web by following instructions and talking to youβ1,387Updated 2 months ago
- Tools for merging pretrained large language models.β5,247Updated this week
- The llama-cpp-agent framework is a tool designed for easy interaction with Large Language Models (LLMs). Allowing users to chat with LLM β¦β527Updated 2 months ago
- NVIDIA Linux open GPU with P2P supportβ1,017Updated last month
- AlwaysReddy is a LLM voice assistant that is always just a hotkey away.β713Updated 2 weeks ago
- An application for running LLMs locally on your device, with your documents, facilitating detailed citations in generated responses.β537Updated 3 months ago
- INT4/INT5/INT8 and FP16 inference on CPU for RWKV language modelβ1,468Updated 3 weeks ago
- A RAG LLM co-pilot for browsing the web, powered by local LLMsβ1,475Updated 2 weeks ago
- A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.β2,822Updated last year
- Python bindings for the Transformer models implemented in C/C++ using GGML library.β1,838Updated last year
- Official implementation of Half-Quadratic Quantization (HQQ)β747Updated this week
- SCUDA is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.β1,637Updated this week
- This repo contains the source code for RULER: Whatβs the Real Context Size of Your Long-Context Language Models?β918Updated 2 weeks ago
- β801Updated 5 months ago
- Simple Python library/structure to ablate features in LLMs which are supported by TransformerLensβ406Updated 8 months ago
- The easiest & fastest way to run customized and fine-tuned LLMs locally or on the edgeβ1,245Updated this week
- Local AI API Platformβ2,451Updated this week