mit-han-lab / pruning-sparsity-publicationsLinks
☆23Updated last year
Alternatives and similar repositories for pruning-sparsity-publications
Users that are interested in pruning-sparsity-publications are comparing it to the libraries listed below
Sorting:
- ☆73Updated 4 months ago
- ☆67Updated 7 months ago
- ☆76Updated last month
- ☆169Updated last year
- A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores☆51Updated last year
- llama INT4 cuda inference with AWQ☆54Updated 4 months ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆110Updated 8 months ago
- GPU operators for sparse tensor operations☆32Updated last year
- Odysseus: Playground of LLM Sequence Parallelism☆70Updated 11 months ago
- a curated list of high-quality papers on resource-efficient LLMs 🌱☆122Updated 2 months ago
- GPTQ inference TVM kernel☆40Updated last year
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆88Updated last week
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆92Updated last week
- SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs☆47Updated 2 months ago
- [ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs☆107Updated last month
- ☆57Updated last month
- ☆71Updated 2 weeks ago
- Penn CIS 5650 (GPU Programming and Architecture) Final Project☆31Updated last year
- Summary of system papers/frameworks/codes/tools on training or serving large model☆57Updated last year
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).☆25Updated 11 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆163Updated 10 months ago
- Implement Flash Attention using Cute.☆85Updated 5 months ago
- pytorch-profiler☆51Updated 2 years ago
- llm theoretical performance analysis tools and support params, flops, memory and latency analysis.☆92Updated last week
- The official PyTorch implementation of the NeurIPS2022 (spotlight) paper, Outlier Suppression: Pushing the Limit of Low-bit Transformer L…☆47Updated 2 years ago
- [ICLR 2024] Jaiswal, A., Gan, Z., Du, X., Zhang, B., Wang, Z., & Yang, Y. Compressing llms: The truth is rarely pure and never simple.☆24Updated last month
- PyTorch bindings for CUTLASS grouped GEMM.☆93Updated last week
- Memory Optimizations for Deep Learning (ICML 2023)☆64Updated last year
- ☆96Updated 8 months ago
- ☆11Updated last year