gpu-mode / popcorn-cliLinks

☆26

Alternatives and similar repositories for popcorn-cli

Users that are interested in popcorn-cli are comparing it to the libraries listed below

Sorting:

triton-lang / kernels
☆81Updated 7 months ago
gpu-mode / reference-kernels
Reference Kernels for the Leaderboard
☆60Updated last week
triton-lang / triton-cpu
An experimental CPU backend for Triton
☆127Updated 3 weeks ago
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆90Updated 3 weeks ago
flashinfer-ai / cutlass-viz
☆60Updated 2 months ago
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆109Updated 11 months ago
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆143Updated last week
Dao-AILab / quack
A Quirky Assortment of CuTe Kernels
☆117Updated this week
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆183Updated 5 months ago
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆87Updated 6 months ago
RadeonFlow / RadeonFlow_Kernels
Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X
☆50Updated last week
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆69Updated last month
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆82Updated last month
DD-DuDa / BitDecoding
A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆49Updated 2 weeks ago
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆77Updated last week
yifuwang / symm-mem-recipes
☆90Updated 6 months ago
rchardx / cuda-gemm
☆23Updated 2 months ago
ColfaxResearch / cutlass-kernels
☆212Updated 11 months ago
microsoft / AttentionEngine
☆71Updated last month
thunlp / TritonBench
TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators
☆59Updated 2 weeks ago
pytorch-labs / helion
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
☆170Updated this week
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆132Updated 2 months ago
Bruce-Lee-LY / cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
☆63Updated 9 months ago
HPMLL / NVIDIA-Hopper-Benchmark
☆44Updated 3 weeks ago
microsoft / SparTA
☆148Updated 11 months ago
INT-FlashAttention2024 / INT-FlashAttention
☆76Updated 5 months ago
xxyux / SpInfer
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆48Updated 3 months ago
pytorch-labs / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆167Updated this week
ColfaxResearch / cfx-article-src
☆117Updated last month
cchan / tccl
extensible collectives library in triton
☆86Updated 2 months ago