lutnn / blink-mm
☆10Updated last year
Related projects ⓘ
Alternatives and complementary repositories for blink-mm
- ☆18Updated last month
- Quantized Attention on GPU☆30Updated 2 weeks ago
- ☆17Updated 4 years ago
- PIM-DL: Expanding the Applicability of Commodity DRAM-PIMs for Deep Learning via Algorithm-System Co-Optimization☆25Updated 9 months ago
- play gemm with tvm☆84Updated last year
- ☆52Updated 2 weeks ago
- A TVM-like CUDA/C code generator.☆9Updated 2 years ago
- ☆31Updated last year
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆85Updated 8 months ago
- Official resporitory for "IPDPS' 24 QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices".☆19Updated 8 months ago
- Manually implemented quantization-aware training☆21Updated 2 years ago
- ☆131Updated 4 months ago
- TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.☆156Updated this week
- Standalone Flash Attention v2 kernel without libtorch dependency☆98Updated 2 months ago
- ☆23Updated last year
- GPTQ inference TVM kernel☆36Updated 6 months ago
- PyTorch compilation tutorial covering TorchScript, torch.fx, and Slapo☆19Updated last year
- Artifact of ASPLOS'23 paper entitled: GRACE: A Scalable Graph-Based Approach to Accelerating Recommendation Model Inference☆16Updated last year
- Converting a deep neural network to integer-only inference in native C via uniform quantization and the fixed-point representation.☆21Updated 2 years ago
- 使用 cutlass 实现 flash-attention 精简版,具有教学意义☆32Updated 3 months ago
- This is a demo how to write a high performance convolution run on apple silicon☆52Updated 2 years ago
- Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS☆17Updated 2 years ago
- PyTorch extension for emulating FP8 data formats on standard FP32 Xeon/GPU hardware.☆100Updated 11 months ago
- An Attention Superoptimizer☆20Updated 6 months ago
- TileFlow is a performance analysis tool based on Timeloop for fusion dataflows☆55Updated 7 months ago
- GPU operators for sparse tensor operations☆29Updated 8 months ago
- ☆21Updated last year
- ☆40Updated 7 months ago
- ☆24Updated 7 months ago
- MobiSys#114☆21Updated last year