LambdaLabsML / llamaLinks

Inference code for LLaMA models

☆42

Alternatives and similar repositories for llama

Users that are interested in llama are comparing it to the libraries listed below

Sorting:

EmbeddedLLM / vllm
vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs
☆87Updated last week
fpgaminer / GPTQ-triton
GPTQ inference Triton kernel
☆303Updated 2 years ago
mlc-ai / llm-perf-bench
☆120Updated last year
SJTU-IPADS / Bamboo
Bamboo-7B Large Language Model
☆93Updated last year
IST-DASLab / qmoe
Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".
☆277Updated last year
efeslab / fiddler
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
☆225Updated 8 months ago
yandex-research / swarm
Official code for "SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient"
☆141Updated last year
GreenBitAI / low_bit_llama
Advanced Ultra-Low Bitrate Compression Techniques for the LLaMA Family of LLMs
☆110Updated last year
neuralmagic / nm-vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆266Updated 9 months ago
NetEase-FuXi / EETQ
Easy and Efficient Quantization for Transformers
☆198Updated last month
dust-tt / llama-ssp
Experiments on speculative sampling with Llama models
☆128Updated 2 years ago
IST-DASLab / SparseFinetuning
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
☆42Updated last year
Cornell-RelaxML / QuIP
Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"
☆376Updated last year
fw-ai / benchmark
Benchmark suite for LLMs from Fireworks.ai
☆76Updated last week
qwopqwop200 / gptqlora
GPTQLoRA: Efficient Finetuning of Quantized LLMs with GPTQ
☆103Updated 2 years ago
chu-tianxiang / QuIP-for-all
QuIP quantization
☆55Updated last year
Infini-AI-Lab / Sequoia
scalable and robust tree-based speculative decoding algorithm
☆354Updated 6 months ago
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆123Updated 8 months ago
abacusai / gh200-llm
Docker image NVIDIA GH200 machines - optimized for vllm serving and hf trainer finetuning
☆47Updated 5 months ago
hahnyuan / PB-LLM
PB-LLM: Partially Binarized Large Language Models
☆153Updated last year
deepspeedai / DeepSpeed-Kernels
☆74Updated 4 months ago
chu-tianxiang / llama-cpp-torch
llama.cpp to PyTorch Converter
☆34Updated last year
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆134Updated 6 months ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆80Updated 11 months ago
Cornell-RelaxML / qtip
☆145Updated last month
SqueezeAILab / SqueezeLLM
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
☆698Updated 11 months ago
apple / ml-recurrent-drafter
☆215Updated 6 months ago
wejoncy / QLLM
A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.
☆175Updated 4 months ago
jaymody / speculative-sampling
Simple implementation of Speculative Sampling in NumPy for GPT-2.
☆95Updated last year
siyan-zhao / prepacking
The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …
☆61Updated 9 months ago