aws-neuron / nki-llamaLinks
Project showing how to develop NKI kernels for Llama 3.2 1B inference
☆20Updated 7 months ago
Alternatives and similar repositories for nki-llama
Users that are interested in nki-llama are comparing it to the libraries listed below
Sorting:
- ☆58Updated 2 weeks ago
- ☆15Updated last week
- ☆63Updated 3 weeks ago
- Supplemental materials for The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning☆25Updated 8 months ago
- ☆84Updated 3 years ago
- A schedule language for large model training☆152Updated 4 months ago
- Github mirror of trition-lang/triton repo.☆119Updated this week
- This repository contains companion software for the Colfax Research paper "Categorical Foundations for CuTe Layouts".☆83Updated 3 months ago
- ☆154Updated last year
- Synthesizer for optimal collective communication algorithms☆123Updated last year
- LLM serving cluster simulator☆132Updated last year
- NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …☆247Updated this week
- ☆47Updated 3 years ago
- AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving (OSDI 23)☆92Updated 2 years ago
- Perplexity GPU Kernels☆552Updated 2 months ago
- Official implementation for the paper Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapp…☆14Updated 2 months ago
- ☆23Updated 4 months ago
- TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches☆79Updated 2 years ago
- ☆256Updated last year
- ATLAHS: An Application-centric Network Simulator Toolchain for AI, HPC, and Distributed Storage☆64Updated this week
- Bamboo is a system for running large pipeline-parallel DNNs affordably, reliably, and efficiently using spot instances.☆55Updated 3 years ago
- Chimera: bidirectional pipeline parallelism for efficiently training large-scale models.☆69Updated 9 months ago
- Microsoft Collective Communication Library☆66Updated last year
- QuickReduce is a performant all-reduce library designed for AMD ROCm that supports inline compression.☆36Updated 4 months ago
- ☆63Updated 6 months ago
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆155Updated this week
- Compiler for Dynamic Neural Networks☆45Updated 2 years ago
- Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and i…☆571Updated this week
- The ASPLOS 2025 / EuroSys 2025 Contest Track☆38Updated 5 months ago
- AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming☆148Updated last week