okuvshynov / llama_duo
asynchronous/distributed speculative evaluation for llama3
☆37Updated 3 months ago
Related projects ⓘ
Alternatives and complementary repositories for llama_duo
- Inference of Mamba models in pure C☆179Updated 8 months ago
- GGML implementation of BERT model with Python bindings and quantization.☆51Updated 9 months ago
- Testing LLM reasoning abilities with family relationship quizzes.☆43Updated this week
- llama.cpp fork with additional SOTA quants and improved performance☆94Updated this week
- iterate quickly with llama.cpp hot reloading. use the llama.cpp bindings with bun.sh☆46Updated last year
- LLM training in simple, raw C/CUDA☆87Updated 6 months ago
- Stable Diffusion in pure C/C++☆60Updated last year
- Experiments with BitNet inference on CPU☆50Updated 7 months ago
- Fast Inference of MoE Models with CPU-GPU Orchestration☆173Updated last week
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆187Updated this week
- Repository for the QUIK project, enabling the use of 4bit kernels for generative inference - EMNLP 2024☆173Updated 7 months ago
- Python bindings for ggml☆132Updated 2 months ago
- Explore training for quantized models☆10Updated 2 weeks ago
- Course Project for COMP4471 on RWKV☆16Updated 9 months ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆35Updated 6 months ago
- Download full or partial git-lfs repos without temporarily using 2x disk space☆30Updated last year
- General purpose GPU compute framework built on Vulkan to support 1000s of cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). …☆41Updated last month
- Port of Microsoft's BioGPT in C/C++ using ggml☆88Updated 9 months ago
- An experimental CPU backend for Triton☆56Updated last week
- ☆101Updated last month
- minimal C implementation of speculative decoding based on llama2.c☆17Updated 4 months ago
- Simple and fast low-bit matmul kernels in CUDA / Triton☆147Updated this week
- ☆71Updated this week
- ☆44Updated 4 months ago
- ☆49Updated 2 weeks ago
- Train your own small bitnet model☆56Updated last month
- IREE's PyTorch Frontend, based on Torch Dynamo.☆56Updated this week
- extensible collectives library in triton☆72Updated 2 months ago
- Web browser version of StarCoder.cpp☆43Updated last year