stanford-cs149 / asst4-trainiumLinks
☆28Updated 7 months ago
Alternatives and similar repositories for asst4-trainium
Users that are interested in asst4-trainium are comparing it to the libraries listed below
Sorting:
- ☆225Updated last week
- extensible collectives library in triton☆87Updated 3 months ago
- ☆33Updated 2 weeks ago
- ☆72Updated last year
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆135Updated last year
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.☆187Updated this week
- TritonParse is a tool designed to help developers analyze and debug Triton kernels by visualizing the compilation process and source code…☆131Updated this week
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆67Updated this week
- A schedule language for large model training☆149Updated last year
- ☆110Updated 4 months ago
- ☆38Updated last week
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems☆475Updated this week
- Collection of kernels written in Triton language☆136Updated 3 months ago
- Cataloging released Triton kernels.☆245Updated 6 months ago
- ☆28Updated 6 months ago
- A Data-Centric Compiler for Machine Learning☆84Updated last year
- MLIR-based partitioning system☆103Updated this week
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆206Updated this week
- Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X☆57Updated last month
- ☆20Updated 2 months ago
- ☆83Updated 8 months ago
- TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators☆62Updated last month
- ☆169Updated last year
- ring-attention experiments☆143Updated 9 months ago
- High-Performance SGEMM on CUDA devices☆97Updated 5 months ago
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆196Updated 2 months ago
- a minimal cache manager for PagedAttention, on top of llama3.☆93Updated 10 months ago
- An experimental CPU backend for Triton☆135Updated last month
- Effective transpose on Hopper GPU☆23Updated 2 months ago
- Evaluating Large Language Models for CUDA Code Generation ComputeEval is a framework designed to generate and evaluate CUDA code from Lar…☆53Updated 3 weeks ago