leloykun / llama2.cpp
Inference Llama 2 in one file of pure C++
☆79Updated last year
Related projects ⓘ
Alternatives and complementary repositories for llama2.cpp
- LLM training in simple, raw C/CUDA☆86Updated 6 months ago
- A collection of all available inference solutions for the LLMs☆72Updated last month
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆89Updated this week
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated 3 weeks ago
- GPT2 implementation in C++ using Ort☆24Updated 3 years ago
- Training and Fine-tuning an llm in Python and PyTorch.☆41Updated last year
- Inference of Mamba models in pure C☆177Updated 8 months ago
- Repo hosting codes and materials related to speeding LLMs' inference using token merging.☆29Updated 6 months ago
- llama3.cuda is a pure C/CUDA implementation for Llama 3 model.☆307Updated 5 months ago
- llama.cpp fork with additional SOTA quants and improved performance☆89Updated this week
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees" adapted for Llama models☆36Updated last year
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.☆148Updated last month
- Inference Vision Transformer (ViT) in plain C/C++ with ggml☆229Updated 7 months ago
- Inference Llama 2 in C++☆45Updated 6 months ago
- Python bindings for ggml☆132Updated 2 months ago
- Train your own small bitnet model☆55Updated 3 weeks ago
- ☆114Updated 6 months ago
- A pipeline for LLM knowledge distillation☆77Updated 3 months ago
- RWKV in nanoGPT style☆177Updated 5 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆46Updated this week
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆261Updated last year
- Experiments on speculative sampling with Llama models☆117Updated last year
- tinygrad port of the RWKV large language model.☆43Updated 4 months ago
- instinct.cpp provides ready to use alternatives to OpenAI Assistant API and built-in utilities for developing AI Agent applications (RAG,…☆37Updated 4 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆250Updated last month
- Fast Inference of MoE Models with CPU-GPU Orchestration☆170Updated 2 weeks ago
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆73Updated 3 weeks ago
- QuIP quantization☆46Updated 7 months ago
- Data preparation code for Amber 7B LLM☆82Updated 6 months ago