openmlsys / openmlsys-enLinks

《Machine Learning Systems: Design and Implementation》- English Version

☆25

Alternatives and similar repositories for openmlsys-en

Users that are interested in openmlsys-en are comparing it to the libraries listed below

Sorting:

mit-han-lab / parallel-computing-tutorial
☆170Updated last year
MDK8888 / vllmini
A minimal implementation of vllm.
☆44Updated 11 months ago
dlsyscourse / hw0
☆36Updated last year
Shenggan / awesome-distributed-ml
A curated list of awesome projects and papers for distributed training or inference
☆239Updated 8 months ago
InternLM / turbomind
☆86Updated 3 months ago
nvixnu / pmpp__programming_massively_parallel_processors
Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…
☆69Updated 4 years ago
mlc-ai / notebooks
☆207Updated 7 months ago
gpu-mode / triton-index
Cataloging released Triton kernels.
☆238Updated 5 months ago
qhliu26 / Dive-into-Big-Model-Training
📑 Dive into Big Model Training
☆114Updated 2 years ago
openmlsys / openmlsys-cuda
Tutorials for writing high-performance GPU operators in AI frameworks.
☆130Updated last year
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆38Updated 4 months ago
xlite-dev / ffpa-attn
⚡️FFPA: Extend FlashAttention-2 with Split-D, achieve ~O(1) SRAM complexity for large headdim, 1.8x~3x↑ vs SDPA.
☆186Updated last month
stanford-futuredata / stk
☆105Updated 10 months ago
Dao-AILab / grouped-latent-attention
☆114Updated 3 weeks ago
Bruce-Lee-LY / decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆38Updated 2 weeks ago
hkproj / pytorch-transformer-distributed
Distributed training (multi-node) of a Transformer model
☆72Updated last year
smart-lty / ParallelSpeculativeDecoding
[ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length
☆90Updated 2 months ago
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆132Updated 2 months ago
leimao / CUDA-GEMM-Optimization
CUDA Matrix Multiplication Optimization
☆196Updated 11 months ago
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆214Updated last year
gpu-mode / ring-attention
ring-attention experiments
☆144Updated 8 months ago
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆87Updated 6 months ago
ColfaxResearch / cutlass-kernels
☆212Updated 11 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆100Updated 3 weeks ago
hkproj / triton-flash-attention
☆174Updated 5 months ago
Deep-Learning-Profiling-Tools / triton-viz
☆219Updated last week
kwai / Megatron-Kwai
[USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…
☆55Updated 10 months ago
lambda7xx / awesome-AI-system
paper and its code for AI System
☆311Updated 2 months ago
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆183Updated 4 months ago
eedalong / ECE408
Code base and slides for ECE408：Applied Parallel Programming On GPU.
☆124Updated 3 years ago