ARM-software / kleidiai
This repository is a read-only mirror of https://gitlab.arm.com/kleidi/kleidiai
☆26Updated this week
Alternatives and similar repositories for kleidiai:
Users that are interested in kleidiai are comparing it to the libraries listed below
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆40Updated last week
- ☆157Updated last week
- Standalone Flash Attention v2 kernel without libtorch dependency☆108Updated 6 months ago
- An experimental CPU backend for Triton☆101Updated this week
- ☆23Updated last month
- ☆63Updated this week
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆74Updated this week
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆102Updated 8 months ago
- Intel® Extension for DeepSpeed* is an extension to DeepSpeed that brings feature support with SYCL kernels on Intel GPU(XPU) device. Note…☆62Updated 3 weeks ago
- ☆116Updated 11 months ago
- DeepSeek-V3/R1 inference performance simulator☆89Updated this week
- ☆73Updated 4 months ago
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆58Updated 6 months ago
- ☆25Updated this week
- ☆45Updated this week
- llama INT4 cuda inference with AWQ☆53Updated 2 months ago
- AI Tensor Engine for ROCm☆142Updated this week
- ☆26Updated last week
- Fast low-bit matmul kernels in Triton☆272Updated this week
- Boosting 4-bit inference kernels with 2:4 Sparsity☆71Updated 6 months ago
- ☆13Updated 3 weeks ago
- Microsoft Collective Communication Library☆60Updated 4 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆235Updated last month
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆90Updated last month
- [EMNLP Findings 2024] MobileQuant: Mobile-friendly Quantization for On-device Language Models☆56Updated 6 months ago
- A Python library transfers PyTorch tensors between CPU and NVMe☆111Updated 4 months ago
- [ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration☆201Updated 4 months ago
- LLM training in simple, raw C/CUDA☆92Updated 10 months ago
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆32Updated this week
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆111Updated 3 months ago