lambdal / deeplearning-benchmark
Benchmark Suite for Deep Learning
☆250Updated this week
Related projects ⓘ
Alternatives and complementary repositories for deeplearning-benchmark
- Pipeline Parallelism for PyTorch☆725Updated 3 months ago
- A tool for bandwidth measurements on NVIDIA GPUs.☆322Updated last month
- A GPU performance profiling tool for PyTorch models☆495Updated 3 years ago
- A library to analyze PyTorch traces.☆306Updated this week
- TorchBench is a collection of open source benchmarks used to evaluate PyTorch performance.☆876Updated this week
- Using the famous cnn model in Pytorch, we run benchmarks on various gpu.☆227Updated 4 months ago
- TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and sup…☆331Updated last week
- Container plugin for Slurm Workload Manager☆295Updated 2 weeks ago
- This repository contains the experimental PyTorch native float8 training UX☆212Updated 3 months ago
- Microsoft Automatic Mixed Precision Library☆525Updated last month
- Applied AI experiments and examples for PyTorch☆168Updated 3 weeks ago
- Implementation of a Transformer, but completely in Triton☆248Updated 2 years ago
- A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.☆734Updated this week
- A Python-level JIT compiler designed to make unmodified PyTorch programs faster.☆1,010Updated 7 months ago
- Provide Python access to the NVML library for GPU diagnostics☆220Updated 3 months ago
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆420Updated this week
- torch::deploy (multipy for non-torch uses) is a system that lets you get around the GIL problem by running multiple Python interpreters i…☆176Updated this week
- A performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind…☆146Updated this week
- Easily benchmark PyTorch model FLOPs, latency, throughput, allocated gpu memory and energy consumption☆92Updated last year
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆627Updated 2 months ago
- Zero Bubble Pipeline Parallelism☆283Updated last week
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆211Updated 3 weeks ago
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs…☆1,982Updated this week
- PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.☆744Updated this week
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆483Updated 3 weeks ago
- Slicing a PyTorch Tensor Into Parallel Shards☆296Updated 3 years ago
- cudnn_frontend provides a c++ wrapper for the cudnn backend API and samples on how to use it☆455Updated last month
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving☆451Updated 2 weeks ago
- JAX-Toolbox☆249Updated this week
- Torch Distributed Experimental☆116Updated 3 months ago