yqhu / profiler-workshop
Example code for profiler workshop
β34Updated 3 years ago
Alternatives and similar repositories for profiler-workshop:
Users that are interested in profiler-workshop are comparing it to the libraries listed below
- π Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.β190Updated this week
- Boosting 4-bit inference kernels with 2:4 Sparsityβ71Updated 6 months ago
- Applied AI experiments and examples for PyTorchβ251Updated last week
- Triton-based implementation of Sparse Mixture of Experts.β209Updated 4 months ago
- Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformersβ207Updated 7 months ago
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Servingβ301Updated 9 months ago
- PyTorch bindings for CUTLASS grouped GEMM.β77Updated 5 months ago
- Cataloging released Triton kernels.β212Updated 2 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.β107Updated this week
- β102Updated 7 months ago
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantizationβ337Updated 7 months ago
- A minimal implementation of vllm.β37Updated 8 months ago
- Fast low-bit matmul kernels in Tritonβ275Updated this week
- Collection of kernels written in Triton languageβ117Updated last month
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsityβ202Updated last year
- Latency and Memory Analysis of Transformer Models for Training and Inferenceβ402Updated 3 weeks ago
- This repository contains the experimental PyTorch native float8 training UXβ222Updated 8 months ago
- β73Updated 4 months ago
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"β278Updated 3 weeks ago
- A collection of memory efficient attention operators implemented in the Triton language.β257Updated 9 months ago
- Simple implementation of Speculative Sampling in NumPy for GPT-2.β92Updated last year
- [NeurIPS'23] Speculative Decoding with Big Little Decoderβ90Updated last year
- Fast Hadamard transform in CUDA, with a PyTorch interfaceβ154Updated 10 months ago
- a minimal cache manager for PagedAttention, on top of llama3.β79Updated 7 months ago
- This repository contains integer operators on GPUs for PyTorch.β198Updated last year
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cacheβ284Updated 2 months ago
- [ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestrationβ202Updated 4 months ago
- Dynamic Memory Management for Serving LLMs without PagedAttentionβ333Updated last week
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β234Updated this week
- β81Updated 3 years ago