stanford-cs149 / asst4-trainiumLinks

☆28

Alternatives and similar repositories for asst4-trainium

Users that are interested in asst4-trainium are comparing it to the libraries listed below

Sorting:

Deep-Learning-Profiling-Tools / triton-viz
☆225Updated last week
cchan / tccl
extensible collectives library in triton
☆87Updated 3 months ago
gpu-mode / popcorn-cli
☆33Updated 2 weeks ago
stanford-cs149 / cs149gpt
☆72Updated last year
siboehm / ShallowSpeed
Small scale distributed training of sequential deep learning models, built on Numpy and MPI.
☆135Updated last year
pytorch-labs / helion
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
☆187Updated this week
pytorch-labs / tritonparse
TritonParse is a tool designed to help developers analyze and debug Triton kernels by visualizing the compilation process and source code…
☆131Updated this week
gpu-mode / reference-kernels
Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!
☆67Updated this week
awslabs / slapo
A schedule language for large model training
☆149Updated last year
bertmaher / simplegemm
☆110Updated 4 months ago
aws-neuron / nki-samples
☆38Updated last week
ScalingIntelligence / KernelBench
KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems
☆475Updated this week
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆136Updated 3 months ago
gpu-mode / triton-index
Cataloging released Triton kernels.
☆245Updated 6 months ago
Jokeren / triton-samples
☆28Updated 6 months ago
spcl / daceml
A Data-Centric Compiler for Machine Learning
☆84Updated last year
openxla / shardy
MLIR-based partitioning system
☆103Updated this week
foundation-model-stack / foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
☆206Updated this week
RadeonFlow / RadeonFlow_Kernels
Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X
☆57Updated last month
ademeure / cuda-side-boost
☆20Updated 2 months ago
triton-lang / kernels
☆83Updated 8 months ago
thunlp / TritonBench
TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators
☆62Updated last month
albanD / subclass_zoo
☆169Updated last year
gpu-mode / ring-attention
ring-attention experiments
☆143Updated 9 months ago
salykova / sgemm.cu
High-Performance SGEMM on CUDA devices
☆97Updated 5 months ago
MekkCyber / CutlassAcademy
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
☆196Updated 2 months ago
tspeterkim / paged-attention-minimal
a minimal cache manager for PagedAttention, on top of llama3.
☆93Updated 10 months ago
triton-lang / triton-cpu
An experimental CPU backend for Triton
☆135Updated last month
simveit / effective_transpose
Effective transpose on Hopper GPU
☆23Updated 2 months ago
NVIDIA / compute-eval
Evaluating Large Language Models for CUDA Code Generation ComputeEval is a framework designed to generate and evaluate CUDA code from Lar…
☆53Updated 3 weeks ago