cli99 / flops-profilerLinks
pytorch-profiler
☆51Updated 2 years ago
Alternatives and similar repositories for flops-profiler
Users that are interested in flops-profiler are comparing it to the libraries listed below
Sorting:
- ☆158Updated last year
- ☆107Updated 11 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆107Updated 2 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆199Updated this week
- This repository contains integer operators on GPUs for PyTorch.☆211Updated last year
- ☆150Updated last year
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆113Updated last year
- Training neural networks in TensorFlow 2.0 with 5x less memory☆132Updated 3 years ago
- ☆79Updated 6 months ago
- Dynamic Tensor Rematerialization prototype (modified PyTorch) and simulator. Paper: https://arxiv.org/abs/2006.09616☆132Updated 2 years ago
- A Python library transfers PyTorch tensors between CPU and NVMe☆117Updated 8 months ago
- ☆228Updated last year
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.☆200Updated this week
- Fast Hadamard transform in CUDA, with a PyTorch interface☆213Updated last year
- ☆144Updated 6 months ago
- A collection of memory efficient attention operators implemented in the Triton language.☆275Updated last year
- ☆42Updated 2 years ago
- Odysseus: Playground of LLM Sequence Parallelism☆72Updated last year
- System for automated integration of deep learning backends.☆47Updated 2 years ago
- Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training☆212Updated 11 months ago
- SparseTIR: Sparse Tensor Compiler for Deep Learning☆137Updated 2 years ago
- Benchmark code for the "Online normalizer calculation for softmax" paper☆95Updated 7 years ago
- ☆154Updated 2 years ago
- ☆85Updated 9 months ago
- ☆75Updated 2 months ago
- ☆102Updated 7 months ago
- llama INT4 cuda inference with AWQ☆54Updated 6 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆260Updated 3 weeks ago
- ☆75Updated 4 years ago
- High Performance Grouped GEMM in PyTorch☆30Updated 3 years ago