likejazz / llama3.cuda
llama3.cuda is a pure C/CUDA implementation for Llama 3 model.
☆305Updated 5 months ago
Related projects ⓘ
Alternatives and complementary repositories for llama3.cuda
- Official implementation of Half-Quadratic Quantization (HQQ)☆698Updated last week
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆494Updated this week
- Step by step explanation/tutorial of llama2.c☆210Updated last year
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆171Updated 3 months ago
- llama3.np is a pure NumPy implementation for Llama 3 model.☆973Updated 5 months ago
- Easy and Efficient Quantization for Transformers☆178Updated 3 months ago
- Python bindings for ggml☆132Updated 2 months ago
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated 3 weeks ago
- ☆501Updated last week
- Fast parallel LLM inference for MLX☆146Updated 4 months ago
- Comparison of the output quality of quantization methods, using Llama 3, transformers, GGUF, EXL2.☆126Updated 5 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆250Updated 3 weeks ago
- Automated Identification of Redundant Layer Blocks for Pruning in Large Language Models☆194Updated 6 months ago
- Inference of Mamba models in pure C☆177Updated 8 months ago
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆661Updated this week
- An innovative library for efficient LLM inference via low-bit quantization☆348Updated 2 months ago
- scalable and robust tree-based speculative decoding algorithm☆313Updated 2 months ago
- An Open Source Toolkit For LLM Distillation☆350Updated last month
- llama.cpp fork with additional SOTA quants and improved performance☆86Updated this week
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆222Updated last month
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"☆346Updated 8 months ago
- Fully fine-tune large models like Mistral, Llama-2-13B, or Qwen-14B completely for free☆219Updated last week
- Comparison of Language Model Inference Engines☆189Updated 2 months ago
- 1.58-bit LLaMa model☆79Updated 7 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆611Updated 2 months ago
- Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.☆118Updated this week
- This is our own implementation of 'Layer Selective Rank Reduction'☆231Updated 5 months ago
- A bagel, with everything.☆312Updated 6 months ago
- A collection of all available inference solutions for the LLMs☆72Updated last month
- A fast batching API to serve LLM models☆172Updated 6 months ago