nicksypark / rope-triton
☆10Updated 5 months ago
Related projects: ⓘ
- 삼각형의 실전! Triton☆14Updated 7 months ago
- Study Group of Deep Learning Compiler☆149Updated last year
- Collection of kernels written in Triton language☆48Updated 2 weeks ago
- PyTorch extension for emulating FP8 data formats on standard FP32 Xeon/GPU hardware.☆98Updated 9 months ago
- ☆186Updated 2 years ago
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆106Updated 6 months ago
- ☆32Updated last year
- A block oriented training approach for inference time optimization.☆26Updated last month
- An experimental CPU backend for Triton☆36Updated last week
- ☆30Updated 9 months ago
- Fast Hadamard transform in CUDA, with a PyTorch interface☆87Updated 3 months ago
- [NeurIPS'23] Speculative Decoding with Big Little Decoder☆84Updated 7 months ago
- ☆21Updated last year
- ☆138Updated 2 months ago
- ☆14Updated 7 months ago
- ☆151Updated last year
- Memory Optimizations for Deep Learning (ICML 2023)☆58Updated 6 months ago
- A performance library for machine learning applications.☆178Updated 11 months ago
- Code for the AAAI 2024 Oral paper "OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Model…☆50Updated 6 months ago
- ☆113Updated last year
- Applied AI experiments and examples for PyTorch☆123Updated last month
- Official implementation of the ICLR 2024 paper AffineQuant☆16Updated 5 months ago
- ☆50Updated 3 months ago
- ☆101Updated last year
- ☆124Updated last week
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆30Updated 4 months ago
- [ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs☆72Updated last month
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆81Updated 2 months ago
- Study parallel programming - CUDA, OpenMP, MPI, Pthread☆54Updated 2 years ago
- ☆27Updated 3 weeks ago