BlinkDL / RWKV-CUDA
The CUDA version of the RWKV language model ( https://github.com/BlinkDL/RWKV-LM )
☆222Updated 4 months ago
Alternatives and similar repositories for RWKV-CUDA:
Users that are interested in RWKV-CUDA are comparing it to the libraries listed below
- Reorder-based post-training quantization for large language model☆187Updated last year
- ☆157Updated last year
- GPTQ inference Triton kernel☆300Updated last year
- The official implementation of the EMNLP 2023 paper LLM-FP4☆197Updated last year
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆248Updated 5 months ago
- ☆147Updated last year
- This repository contains integer operators on GPUs for PyTorch.☆202Updated last year
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"☆284Updated last month
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆340Updated 8 months ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆106Updated 9 months ago
- 📚FFPA(Split-D): Yet another Faster Flash Attention with O(1) GPU SRAM complexity large headdim, 1.8x~3x↑🎉 faster than SDPA EA.