BlinkDL / RWKV-CUDALinks

The CUDA version of the RWKV language model ( https://github.com/BlinkDL/RWKV-LM )

☆223

Alternatives and similar repositories for RWKV-CUDA

Users that are interested in RWKV-CUDA are comparing it to the libraries listed below

Sorting:

hahnyuan / RPTQ4LLM
Reorder-based post-training quantization for large language model
☆194Updated 2 years ago
jundaf2 / INT8-Flash-Attention-FMHA-Quantization
☆158Updated 2 years ago
nbasyl / LLM-FP4
The official implementation of the EMNLP 2023 paper LLM-FP4
☆217Updated last year
haochengxi / Train_Transformers_with_INT4
☆156Updated 2 years ago
fpgaminer / GPTQ-triton
GPTQ inference Triton kernel
☆311Updated 2 years ago
kyegomez / FlashAttention20
Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernels
☆108Updated 2 years ago
xlite-dev / ffpa-attn
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆226Updated 2 months ago
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆268Updated 3 months ago
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆112Updated last year
openmlsys / openmlsys-cuda
Tutorials for writing high-performance GPU operators in AI frameworks.
☆133Updated 2 years ago
OpenNLPLab / lightning-attention
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
☆330Updated 8 months ago
FlagOpen / FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
☆283Updated last year
wejoncy / QLLM
A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.
☆180Updated 7 months ago
tpoisonooo / llama.onnx
LLaMa/RWKV onnx models, quantization and testcase
☆367Updated 2 years ago
shreyansh26 / FlashAttention-PyTorch
Implementation of FlashAttention in PyTorch
☆172Updated 9 months ago
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆120Updated last year
facebookresearch / LLM-QAT
Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"
☆317Updated 7 months ago
thu-ml / low-bit-optimizers
Low-bit optimizers for PyTorch
☆132Updated 2 years ago
InternLM / turbomind
☆97Updated 7 months ago
HandH1998 / QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆144Updated 2 months ago
casper-hansen / AutoAWQ_kernels
☆78Updated 11 months ago
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆323Updated last year
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆41Updated 8 months ago
Guangxuan-Xiao / torch-int
This repository contains integer operators on GPUs for PyTorch.
☆220Updated 2 years ago
OpenPPL / ppl.llm.serving
☆129Updated 10 months ago
mit-han-lab / tinychat-tutorial
☆77Updated 11 months ago
daquexian / faster-rwkv
☆124Updated last year
microsoft / AttentionEngine
☆102Updated 5 months ago
bytedance / ByteTransformer
optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052
☆479Updated last year
megvii-research / Sparsebit
A model compression and acceleration toolbox based on pytorch.
☆331Updated last year