arakhmati / torchtrail
torchtrail: trace the graph of torch functions and modules for visualization, reports, etc
☆25Updated 11 months ago
Alternatives and similar repositories for torchtrail
Users that are interested in torchtrail are comparing it to the libraries listed below
Sorting:
- extensible collectives library in triton☆86Updated last month
- LLM training in simple, raw C/CUDA☆95Updated last year
- High-Performance SGEMM on CUDA devices☆91Updated 3 months ago
- Explore training for quantized models☆18Updated 4 months ago
- A place to store reusable transformer components of my own creation or found on the interwebs☆55Updated this week
- Experimental GPU language with meta-programming☆22Updated 8 months ago
- Make triton easier☆47Updated 11 months ago
- Experiment of using Tangent to autodiff triton☆78Updated last year
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆41Updated last year
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆173Updated last week
- ☆205Updated 3 weeks ago
- ⭐️ TTNN Compiler for PyTorch 2 ⭐️ It enables running PyTorch models on Tenstorrent hardware using torch.compile path☆38Updated this week
- ☆203Updated 10 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆73Updated 8 months ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆109Updated 10 months ago
- Fast low-bit matmul kernels in Triton☆301Updated this week
- ☆21Updated 2 months ago
- A bunch of kernels that might make stuff slower 😉☆40Updated this week
- Applied AI experiments and examples for PyTorch☆267Updated this week
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆106Updated 7 months ago
- Attention in SRAM on Tenstorrent Grayskull☆35Updated 10 months ago
- ☆79Updated 6 months ago
- Cray-LM unified training and inference stack.☆22Updated 3 months ago
- ☆158Updated last year
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆45Updated 10 months ago
- train with kittens!☆57Updated 6 months ago
- ☆16Updated 7 months ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆132Updated last year
- Tenstorrent's MLIR Based Compiler. We aim to enable developers to run AI on all configurations of Tenstorrent hardware, through an open-s…☆48Updated this week
- Fast Hadamard transform in CUDA, with a PyTorch interface☆187Updated 11 months ago