aws-neuron / nki-llamaLinks
Project showing how to develop NKI kernels for Llama 3.2 1B inference
☆20Updated 8 months ago
Alternatives and similar repositories for nki-llama
Users that are interested in nki-llama are comparing it to the libraries listed below
Sorting:
- ☆60Updated this week
- ☆17Updated this week
- A schedule language for large model training☆152Updated 5 months ago
- ☆64Updated last month
- Supplemental materials for The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning☆25Updated 8 months ago
- ☆159Updated last year
- NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …☆262Updated this week
- Github mirror of trition-lang/triton repo.☆128Updated this week
- ☆84Updated 3 years ago
- ☆18Updated last year
- Perplexity GPU Kernels☆560Updated 3 months ago
- This repository contains companion software for the Colfax Research paper "Categorical Foundations for CuTe Layouts".☆103Updated 4 months ago
- Building the Virtuous Cycle for AI-driven LLM Systems☆151Updated last week
- Autocomp: AI-Driven Code Optimizer for Tensor Accelerators☆70Updated last week
- The ASPLOS 2025 / EuroSys 2025 Contest Track☆39Updated 6 months ago
- Artifact from "Hardware Compute Partitioning on NVIDIA GPUs". THIS IS A FORK OF BAKITAS REPO. I AM NOT ONE OF THE AUTHORS OF THE PAPER.☆55Updated 2 months ago
- LLM serving cluster simulator☆135Updated last year
- ☆104Updated last year
- Evaluating Large Language Models for CUDA Code Generation ComputeEval is a framework designed to generate and evaluate CUDA code from Lar…☆96Updated last month
- AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming☆168Updated this week
- ☆48Updated last year
- Synthesizer for optimal collective communication algorithms☆124Updated last year
- Microsoft Collective Communication Library☆66Updated last year
- ☆175Updated 9 months ago
- QuickReduce is a performant all-reduce library designed for AMD ROCm that supports inline compression.☆36Updated 5 months ago
- FlashInfer Bench @ MLSys 2026: Building AI agents to write high performance GPU kernels☆84Updated 2 weeks ago
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)☆792Updated 3 weeks ago
- ☆36Updated 12 years ago
- Boost hardware utilization for ML training workloads via Inter-model Horizontal Fusion☆32Updated last year
- torchcomms: a modern PyTorch communications API☆327Updated this week