likejazz / llama3.cuda

llama3.cuda is a pure C/CUDA implementation for Llama 3 model.

☆305

Related projects ⓘ

Alternatives and complementary repositories for llama3.cuda

mobiusml / hqq
Official implementation of Half-Quadratic Quantization (HQQ)
☆698Updated last week
microsoft / VPTQ
VPTQ, A Flexible and Extreme low-bit quantization algorithm
☆494Updated this week
RahulSChand / llama2.c-for-dummies
Step by step explanation/tutorial of llama2.c
☆210Updated last year
VITA-Group / Q-GaLore
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.
☆171Updated 3 months ago
likejazz / llama3.np
llama3.np is a pure NumPy implementation for Llama 3 model.
☆973Updated 5 months ago
NetEase-FuXi / EETQ
Easy and Efficient Quantization for Transformers
☆178Updated 3 months ago
abetlen / ggml-python
Python bindings for ggml
☆132Updated 2 months ago
astramind-ai / BitMat
An efficent implementation of the method proposed in "The Era of 1-bit LLMs"
☆154Updated 3 weeks ago
Cornell-RelaxML / quip-sharp
☆501Updated last week
willccbb / mlx_parallm
Fast parallel LLM inference for MLX
☆146Updated 4 months ago
matt-c1 / llama-3-quant-comparison
Comparison of the output quality of quantization methods, using Llama 3, transformers, GGUF, EXL2.
☆126Updated 5 months ago
neuralmagic / nm-vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆250Updated 3 weeks ago
arcee-ai / PruneMe
Automated Identification of Redundant Layer Blocks for Pruning in Large Language Models
☆194Updated 6 months ago
kroggen / mamba.c
Inference of Mamba models in pure C
☆177Updated 8 months ago
vllm-project / llm-compressor
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
☆661Updated this week
intel / neural-speed
An innovative library for efficient LLM inference via low-bit quantization
☆348Updated 2 months ago
Infini-AI-Lab / Sequoia
scalable and robust tree-based speculative decoding algorithm
☆313Updated 2 months ago
arcee-ai / DistillKit
An Open Source Toolkit For LLM Distillation
☆350Updated last month
ikawrakow / ik_llama.cpp
llama.cpp fork with additional SOTA quants and improved performance
☆86Updated this week
OpenGVLab / EfficientQAT
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
☆222Updated last month
Cornell-RelaxML / QuIP
Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"
☆346Updated 8 months ago
Locutusque / TPU-Alignment
Fully fine-tune large models like Mistral, Llama-2-13B, or Qwen-14B completely for free
☆219Updated last week
lapp0 / lm-inference-engines
Comparison of Language Model Inference Engines
☆189Updated 2 months ago
rafacelente / bllama
1.58-bit LLaMa model
☆79Updated 7 months ago
IST-DASLab / marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆611Updated 2 months ago
ModelCloud / GPTQModel
Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.
☆118Updated this week
cognitivecomputations / laserRMT
This is our own implementation of 'Layer Selective Rank Reduction'
☆231Updated 5 months ago
jondurbin / bagel
A bagel, with everything.
☆312Updated 6 months ago
mani-kantap / llm-inference-solutions
A collection of all available inference solutions for the LLMs
☆72Updated last month
epolewski / EricLLM
A fast batching API to serve LLM models
☆172Updated 6 months ago