okuvshynov / llama_duoLinks
asynchronous/distributed speculative evaluation for llama3
☆38Updated last year
Alternatives and similar repositories for llama_duo
Users that are interested in llama_duo are comparing it to the libraries listed below
Sorting:
- Inference of Mamba models in pure C☆191Updated last year
- The Finite Field Assembly Programming Language☆36Updated 4 months ago
- A minimalistic C++ Jinja templating engine for LLM chat templates☆187Updated 2 weeks ago
- High-Performance SGEMM on CUDA devices☆105Updated 8 months ago
- Samples of good AI generated CUDA kernels☆91Updated 4 months ago
- Custom PTX Instruction Benchmark☆128Updated 7 months ago
- A faithful clone of Karpathy's llama2.c (one file inference, zero dependency) but fully functional with LLaMA 3 8B base and instruct mode…☆138Updated last year
- LLM training in simple, raw C/CUDA☆105Updated last year
- Repository for the QUIK project, enabling the use of 4bit kernels for generative inference - EMNLP 2024☆182Updated last year
- GGML implementation of BERT model with Python bindings and quantization.☆55Updated last year
- GGUF implementation in C as a library and a tools CLI program☆291Updated last month
- Thin wrapper around GGML to make life easier☆39Updated 3 months ago
- tiny code to access tenstorrent blackhole☆59Updated 4 months ago
- Inference RWKV v7 in pure C.☆40Updated last week
- Python bindings for ggml☆146Updated last year
- Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.☆373Updated this week
- General purpose GPU compute framework built on Vulkan to support 1000s of cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). …☆52Updated 7 months ago
- Lightweight Llama 3 8B Inference Engine in CUDA C☆48Updated 6 months ago
- Super fast FP32 matrix multiplication on RDNA3☆74Updated 6 months ago
- iterate quickly with llama.cpp hot reloading. use the llama.cpp bindings with bun.sh☆50Updated last year
- RDNA3 emulator☆54Updated 5 months ago
- Fast and vectorizable algorithms for searching in a vector of sorted floating point numbers☆151Updated 9 months ago
- Experiments with BitNet inference on CPU☆54Updated last year
- Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.☆73Updated 8 months ago
- Simple high-throughput inference library☆142Updated 4 months ago
- tinygrad port of the RWKV large language model.☆45Updated 7 months ago
- ☆218Updated 8 months ago
- Learning about CUDA by writing PTX code.☆138Updated last year
- GPT2 implementation in C++ using Ort☆26Updated 4 years ago
- Attention in SRAM on Tenstorrent Grayskull☆38Updated last year