microsoft / FractalTensor
β22Updated 3 weeks ago
Alternatives and similar repositories for FractalTensor:
Users that are interested in FractalTensor are comparing it to the libraries listed below
- β36Updated this week
- β19Updated 3 months ago
- π[WIP] FFPA: Yet antother Faster Flash Prefill Attention with O(1)πGPU SRAM complexity for headdim > 256, 1.8x~3xβπfaster vs SDPA EA.β44Updated this week
- GPTQ inference TVM kernelβ38Updated 8 months ago
- Triton to TVM transpiler.β19Updated 3 months ago
- A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.β22Updated 3 months ago
- β21Updated last week
- Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoSβ18Updated 3 years ago
- β‘οΈWrite HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, achieve peakβ‘οΈ performanceβ43Updated this week
- Optimize tensor program fast with Felix, a gradient descent autotuner.β24Updated 8 months ago
- Open deep learning compiler stack for cpu, gpu and specialized acceleratorsβ17Updated 2 weeks ago
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformerβ87Updated 10 months ago
- PyTorch compilation tutorial covering TorchScript, torch.fx, and Slapoβ19Updated last year
- FlexFlow Serve: Low-Latency, High-Performance LLM Servingβ15Updated this week
- Framework to reduce autotune overhead to zero for well known deployments.β57Updated last month
- β11Updated 3 years ago
- TensorRT LLM Benchmark Configurationβ12Updated 5 months ago
- Artifacts of EVT ASPLOS'24β22Updated 10 months ago
- Implement Flash Attention using Cute.β65Updated last month
- ThrillerFlow is a Dataflow Analysis and Codegen Framework written in Rust.β14Updated last month
- Standalone Flash Attention v2 kernel without libtorch dependencyβ99Updated 4 months ago
- ASPLOS'24: Optimal Kernel Orchestration for Tensor Programs with Korchβ30Updated 5 months ago
- Quantized Attention on GPUβ34Updated last month
- β25Updated 10 months ago
- TiledKernel is a code generation library based on macro kernels and memory hierarchy graph data structure.β19Updated 8 months ago
- PTX-EMU is a simple emulator for CUDA program.β26Updated last year
- play gemm with tvmβ85Updated last year
- Debug print operator for cudagraph debuggingβ10Updated 5 months ago
- A source-to-source compiler for optimizing CUDA dynamic parallelism by aggregating launchesβ14Updated 5 years ago
- An extention of TVMScript to write simple and high performance GPU kernels with tensorcore.β51Updated 5 months ago