daquexian / faster-rwkv
☆123Updated 10 months ago
Related projects ⓘ
Alternatives and complementary repositories for faster-rwkv
- Inference rwkv5 or rwkv6 with Qualcomm AI Engine Direct SDK☆36Updated this week
- simplify >2GB large onnx model☆42Updated 8 months ago
- stable diffusion using mnn☆62Updated last year
- llm deploy project based onnx.☆26Updated last month
- ☆82Updated last year
- A converter for llama2.c legacy models to ncnn models.☆82Updated 10 months ago
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆26Updated 2 months ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆98Updated last month
- ☆140Updated 6 months ago
- A Toolkit to Help Optimize Large Onnx Model☆147Updated 5 months ago
- ☆28Updated 3 months ago
- ☆56Updated this week
- An easy-to-use package for implementing SmoothQuant for LLMs☆82Updated 5 months ago
- ☆123Updated this week
- 使用 cutlass 实现 flash-attention 精简版,具有教学意义☆31Updated 2 months ago
- NVIDIA TensorRT Hackathon 2023复赛选题:通义千问Qwen-7B用TensorRT-LLM模型搭建及优化☆40Updated last year
- 使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention☆51Updated 2 months ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆87Updated 3 months ago
- A converter and basic tester for rwkv onnx☆41Updated 9 months ago
- qwen2 and llama3 cpp implementation☆34Updated 5 months ago
- A Toolkit to Help Optimize Onnx Model☆75Updated this week
- Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.☆24Updated this week
- Inference RWKV with multiple supported backends.☆26Updated 2 months ago
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models☆20Updated 7 months ago
- ☆79Updated 2 months ago
- A quantization algorithm for LLM☆101Updated 4 months ago
- Transformer related optimization, including BERT, GPT☆60Updated last year
- OneFlow->ONNX☆42Updated last year
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.☆148Updated last month