aws-neuron / nki-llamaLinks
Project showing how to develop NKI kernels for Llama 3.2 1B inference
☆19Updated 4 months ago
Alternatives and similar repositories for nki-llama
Users that are interested in nki-llama are comparing it to the libraries listed below
Sorting:
- ☆47Updated this week
- ☆60Updated last week
- ☆15Updated this week
- A schedule language for large model training☆151Updated last month
- ☆121Updated 9 months ago
- NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …☆223Updated last week
- ☆83Updated 2 years ago
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems☆581Updated 2 weeks ago
- QuickReduce is a performant all-reduce library designed for AMD ROCm that supports inline compression.☆33Updated last month
- ☆242Updated this week
- Github mirror of trition-lang/triton repo.☆78Updated this week
- Chimera: bidirectional pipeline parallelism for efficiently training large-scale models.☆66Updated 6 months ago
- ☆238Updated last year
- ☆22Updated last week
- Perplexity GPU Kernels☆476Updated 2 weeks ago
- ☆144Updated 4 months ago
- ☆37Updated 2 months ago
- DietCode Code Release☆65Updated 3 years ago
- ☆30Updated last year
- Cataloging released Triton kernels.☆261Updated 3 weeks ago
- MLIR-based partitioning system☆135Updated this week
- Distributed MoE in a Single Kernel [NeurIPS '25]☆49Updated this week
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆221Updated 2 years ago
- ☆108Updated last year
- Compiler for Dynamic Neural Networks☆46Updated last year
- Magicube is a high-performance library for quantized sparse matrix operations (SpMM and SDDMM) of deep learning on Tensor Cores.☆89Updated 2 years ago
- This repository contains companion software for the Colfax Research paper "Categorical Foundations for CuTe Layouts".☆48Updated last week
- ☆23Updated last month
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.☆328Updated this week
- Supplemental materials for The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning☆23Updated 4 months ago