ptillet / triton-llvm-releasesLinks
☆22Updated 2 years ago
Alternatives and similar repositories for triton-llvm-releases
Users that are interested in triton-llvm-releases are comparing it to the libraries listed below
Sorting:
- Benchmark tests supporting the TiledCUDA library.☆18Updated last year
- FlexAttention w/ FlashAttention3 Support☆27Updated last year
- Inference framework for MoE layers based on TensorRT with Python binding☆41Updated 4 years ago
- ☆50Updated last year
- CUDA 12.2 HMM demos☆20Updated last year
- ☆16Updated last year
- GPTQ inference TVM kernel☆41Updated last year
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…☆27Updated this week
- Framework to reduce autotune overhead to zero for well known deployments.☆91Updated 3 months ago
- TORCH_LOGS parser for PT2☆70Updated last month
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆104Updated 5 months ago
- Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large …☆65Updated 3 years ago
- Prototype routines for GPU quantization written using PyTorch.☆21Updated 4 months ago
- ☆22Updated 2 years ago
- ☆71Updated 9 months ago
- torch.compile artifacts for common deep learning models, can be used as a learning resource for torch.compile☆18Updated 2 years ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆112Updated last year
- No-GIL Python environment featuring NVIDIA Deep Learning libraries.☆69Updated 8 months ago
- PyTorch implementation of the Flash Spectral Transform Unit.☆21Updated last year
- Customized matrix multiplication kernels☆57Updated 3 years ago
- Ahead of Time (AOT) Triton Math Library☆84Updated 2 weeks ago
- A tracing JIT for PyTorch☆17Updated 3 years ago
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Updated 2 years ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆123Updated last year
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆73Updated last year
- ☆23Updated 7 months ago
- ☆99Updated last year
- extensible collectives library in triton☆91Updated 8 months ago
- A Python library transfers PyTorch tensors between CPU and NVMe☆123Updated last year
- Make triton easier☆49Updated last year