open-lm-engine / flash-model-architecturesLinks

A bunch of kernels that might make stuff slower 😉

☆62

Alternatives and similar repositories for flash-model-architectures

Users that are interested in flash-model-architectures are comparing it to the libraries listed below

Sorting:

cchan / tccl
extensible collectives library in triton
☆89Updated 6 months ago
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆157Updated 6 months ago
gpu-mode / ring-attention
ring-attention experiments
☆154Updated last year
triton-lang / kernels
☆92Updated 11 months ago
meta-pytorch / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆264Updated this week
mobiusml / gemlite
Fast low-bit matmul kernels in Triton
☆381Updated 3 weeks ago
gpu-mode / triton-index
Cataloging released Triton kernels.
☆263Updated last month
Deep-Learning-Profiling-Tools / triton-viz
☆240Updated this week
stanford-futuredata / stk
☆112Updated last year
meta-pytorch / applied-ai
Applied AI experiments and examples for PyTorch
☆299Updated 2 months ago
meta-pytorch / float8_experimental
This repository contains the experimental PyTorch native float8 training UX
☆223Updated last year
andylolu2 / simpleGEMM
The simplest but fast implementation of matrix multiplication in CUDA.
☆39Updated last year
meta-pytorch / kraken
Triton-based Symmetric Memory operators and examples
☆38Updated this week
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆246Updated 3 weeks ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆124Updated 4 months ago
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆116Updated last year
IST-DASLab / qutlass
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆119Updated 3 weeks ago
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆84Updated last month
tspeterkim / paged-attention-minimal
a minimal cache manager for PagedAttention, on top of llama3.
☆124Updated last year
gau-nernst / quantized-training
Explore training for quantized models
☆25Updated 3 months ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆83Updated last year
pytorch / helion
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
☆389Updated this week
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆100Updated 3 months ago
Dao-AILab / grouped-latent-attention
☆130Updated 4 months ago
Jokeren / triton-samples
☆28Updated 9 months ago
jundaf2 / INT8-Flash-Attention-FMHA-Quantization
☆158Updated 2 years ago
MekkCyber / CutlassAcademy
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
☆233Updated 5 months ago
foundation-model-stack / foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
☆215Updated this week
gau-nernst / learn-cuda
Learn CUDA with PyTorch
☆92Updated 3 weeks ago
alexzhang13 / Triton-Puzzles-Solutions
Personal solutions to the Triton Puzzles
☆20Updated last year