daemyung / practice-tritonLinks
삼각형의 실전! Triton
☆16Updated last year
Alternatives and similar repositories for practice-triton
Users that are interested in practice-triton are comparing it to the libraries listed below
Sorting:
- A performance library for machine learning applications.☆184Updated 2 years ago
 - Large scale 4D parallelism pre-training for 🤗 transformers in Mixture of Experts *(still work in progress)*☆87Updated last year
 - A hackable, simple, and reseach-friendly GRPO Training Framework with high speed weight synchronization in a multinode environment.☆31Updated 2 months ago
 - QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆118Updated last year
 - ☆27Updated last year
 - The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …☆60Updated last year
 - Flexibly track outputs and grad-outputs of torch.nn.Module.☆13Updated 2 years ago
 - Easy and Efficient Quantization for Transformers☆202Updated 4 months ago
 - Pytorch/XLA SPMD Test code in Google TPU☆23Updated last year
 - Automatic differentiation for Triton Kernels☆13Updated 2 months ago
 - some common Huggingface transformers in maximal update parametrization (µP)☆86Updated 3 years ago
 - ☆83Updated last year
 - OSLO: Open Source for Large-scale Optimization☆174Updated 2 years ago
 - The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆92Updated 3 months ago
 - Official implementation for Training LLMs with MXFP4☆101Updated 6 months ago
 - ring-attention experiments☆155Updated last year
 - ☆69Updated last year
 - ☆121Updated last year
 - [NeurIPS'23] Speculative Decoding with Big Little Decoder☆94Updated last year
 - This repository contains the experimental PyTorch native float8 training UX☆223Updated last year
 - 🔮 LLM GPU Calculator☆21Updated 2 years ago
 - JORA: JAX Tensor-Parallel LoRA Library (ACL 2024)☆36Updated last year
 - ☆46Updated last year
 - Transformers components but in Triton☆34Updated 5 months ago
 - Implementation of the Llama architecture with RLHF + Q-learning☆167Updated 9 months ago
 - Boosting 4-bit inference kernels with 2:4 Sparsity☆85Updated last year
 - Mixed precision training from scratch with Tensors and CUDA☆28Updated last year
 - Experiment of using Tangent to autodiff triton☆80Updated last year
 - Load compute kernels from the Hub☆308Updated last week
 - CUDA and Triton implementations of Flash Attention with SoftmaxN.☆73Updated last year