andrewkchan / yalm
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
☆211Updated this week
Alternatives and similar repositories for yalm:
Users that are interested in yalm are comparing it to the libraries listed below
- Materials for learning SGLang☆166Updated last week
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆219Updated last week
- Fastest kernels written from scratch☆118Updated last month
- Fast low-bit matmul kernels in Triton☆187Updated last week
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving☆481Updated 2 months ago
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆496Updated this week
- ☆170Updated this week
- Repository for the QUIK project, enabling the use of 4bit kernels for generative inference - EMNLP 2024☆175Updated 9 months ago
- Flash Attention in ~100 lines of CUDA (forward pass only)☆681Updated 2 weeks ago
- Dynamic Memory Management for Serving LLMs without PagedAttention☆272Updated last month
- Efficient LLM Inference over Long Sequences☆344Updated 2 weeks ago
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆290Updated 6 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆680Updated 4 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆230Updated 2 months ago
- ☆185Updated last month
- Applied AI experiments and examples for PyTorch☆211Updated this week
- CUDA/Metal accelerated language model inference☆489Updated last month
- Cataloging released Triton kernels.☆155Updated last week
- A low-latency & high-throughput serving engine for LLMs☆296Updated 4 months ago
- A throughput-oriented high-performance serving framework for LLMs☆692Updated 3 months ago
- a minimal cache manager for PagedAttention, on top of llama3.☆59Updated 4 months ago
- Fast Inference of MoE Models with CPU-GPU Orchestration☆179Updated 2 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆257Updated 3 months ago
- scalable and robust tree-based speculative decoding algorithm☆329Updated 5 months ago
- ring-attention experiments☆116Updated 3 months ago
- Easy and Efficient Quantization for Transformers☆191Updated last month
- ☆178Updated 6 months ago
- flash attention tutorial written in python, triton, cuda, cutlass☆244Updated 2 weeks ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆64Updated 4 months ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆93Updated 6 months ago