marsupialtail / sparsednnLinks

Fast sparse deep learning on CPUs

☆56

Alternatives and similar repositories for sparsednn

Users that are interested in sparsednn are comparing it to the libraries listed below

Sorting:

spcl / substation
Research and development for optimizing transformers
☆131Updated 4 years ago
jundaf2 / INT8-Flash-Attention-FMHA-Quantization
☆158Updated 2 years ago
masahi / torchscript-to-tvm
☆68Updated 2 years ago
YashasSamaga / ConvolutionBuildingBlocks
GEMM and Winograd based convolutions using CUTLASS
☆28Updated 5 years ago
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆117Updated last year
parasj / checkmate
Training neural networks in TensorFlow 2.0 with 5x less memory
☆136Updated 3 years ago
NVIDIA / online-softmax
Benchmark code for the "Online normalizer calculation for softmax" paper
☆102Updated 7 years ago
google-research / sputnik
A library of GPU kernels for sparse matrix operations.
☆275Updated 4 years ago
cmu-catalyst / collage
System for automated integration of deep learning backends.
☆47Updated 3 years ago
uwsampl / SparseTIR
SparseTIR: Sparse Tensor Compiler for Deep Learning
☆138Updated 2 years ago
ankan-ban / llama_cu_awq
llama INT4 cuda inference with AWQ
☆55Updated 9 months ago
YulhwaKim / cutlass_tilesparse
CUDA templates for tile-sparse matrix multiplication based on CUTLASS.
☆50Updated 7 years ago
iree-org / iree-nvgpu
☆50Updated last year
DeMoriarty / custom_matmul_kernels
Customized matrix multiplication kernels
☆57Updated 3 years ago
awslabs / raf
☆145Updated 9 months ago
mit-han-lab / inter-operator-scheduler
[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration
☆200Updated 3 years ago
graphcore / tutorials
Training material for IPU users: tutorials, feature examples, simple applications
☆87Updated 2 years ago
IntelLabs / FP8-Emulation-Toolkit
PyTorch extension for emulating FP8 data formats on standard FP32 Xeon/GPU hardware.
☆111Updated 10 months ago
facebookresearch / fairring
Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large …
☆65Updated 3 years ago
Bruce-Lee-LY / cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
☆67Updated last year
mingfeima / pytorch_profiler_parser
parser script to process pytorch autograd profiler result, convert json file to excel.
☆15Updated 6 years ago
intel / torch-ccl
oneCCL Bindings for Pytorch*
☆102Updated 2 months ago
tlc-pack / TLCBench
Benchmark scripts for TVM
☆74Updated 3 years ago
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆266Updated 3 months ago
facebookexperimental / triton
Github mirror of trition-lang/triton repo.
☆92Updated this week
Emma926 / paradnn
ParaDnn: A systematic performance analysis methodology for deep learning.
☆39Updated 5 years ago
deepspeedai / DeepSpeed-Kernels
☆71Updated 7 months ago
facebookresearch / FAMBench
Benchmarks to capture important workloads.
☆31Updated 9 months ago
mlcommons / training_results_v1.0
This repository contains the results and code for the MLPerf™ Training v1.0 benchmark.
☆37Updated last year
triton-lang / kernels
☆93Updated 11 months ago