Lizonghang / prima.cpp

prima.cpp: Speeding up 70B-scale LLM inference on low-resource everyday home clusters

☆260

Alternatives and similar repositories for prima.cpp:

Users that are interested in prima.cpp are comparing it to the libraries listed below

Lizonghang / TPI-LLM
TPI-LLM: Serving 70b-scale LLMs Efficiently on Low-resource Edge Devices
☆176Updated 5 months ago
andrewkchan / deepseek.cpp
CPU inference for the DeepSeek family of large language models in C++
☆288Updated this week
likejazz / llama3.cuda
llama3.cuda is a pure C/CUDA implementation for Llama 3 model.
☆331Updated 10 months ago
ikawrakow / ik_llama.cpp
llama.cpp fork with additional SOTA quants and improved performance
☆292Updated this week
andrewkchan / yalm
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
☆282Updated 3 months ago
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆368Updated this week
microsoft / VPTQ
VPTQ, A Flexible and Extreme low-bit quantization algorithm
☆626Updated 3 weeks ago
PrimeIntellect-ai / prime
prime is a framework for efficient, globally distributed training of AI models over the internet.
☆701Updated this week
foldl / chatllm.cpp
Pure C++ implementation of several models for real-time chatting on your computer (CPU & GPU)
☆573Updated this week
PrimeIntellect-ai / OpenDiloco
OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training
☆486Updated 3 months ago
intel / auto-round
Advanced Quantization Algorithm for LLMs/VLMs.
☆431Updated this week
and270 / thinking_effort_processor
☆85Updated last month
bytedance / flux
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
☆887Updated this week
intel / neural-speed
An innovative library for efficient LLM inference via low-bit quantization
☆352Updated 7 months ago
HazyResearch / minions
Big & Small LLMs working together
☆708Updated this week
dCaples / AutoDidact
Autonomously train research-agent LLMs on custom data using reinforcement learning and self-verification.
☆595Updated 3 weeks ago
apple / ml-recurrent-drafter
☆207Updated 2 months ago
huggingface / local-gemma
Gemma 2 optimized for your local machine.
☆367Updated 8 months ago
efeslab / Nanoflow
A throughput-oriented high-performance serving framework for LLMs
☆794Updated 7 months ago
microsoft / MInference
[NeurIPS'24 Spotlight, ICLR'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which r…
☆971Updated this week
stevelaskaridis / awesome-mobile-llm
Awesome Mobile LLMs
☆166Updated 3 weeks ago
policy-gradient / GRPO-Zero
Implementing DeepSeek R1's GRPO algorithm from scratch
☆445Updated this week
willkurt / token-explorer
A simple tool that let's you explore different possible paths that an LLM might sample.
☆161Updated last week
Infini-AI-Lab / UMbreLLa
LLM Inference on consumer devices
☆105Updated last month
ServerlessLLM / ServerlessLLM
Serverless LLM Serving for Everyone.
☆458Updated this week
ml-explore / mlx-lm
Run LLMs with MLX
☆421Updated this week
horus-ai-labs / DistillFlow
☆140Updated 2 months ago
microsoft / T-MAC
Low-bit LLM inference on CPU with lookup table
☆720Updated 3 months ago
willccbb / mlx_parallm
Fast parallel LLM inference for MLX
☆181Updated 9 months ago
MoonshotAI / MoBA
MoBA: Mixture of Block Attention for Long-Context LLMs
☆1,746Updated 2 weeks ago