lcy-seso / DLFrameworkTest
My tests and experiments with some popular dl frameworks.
☆11Updated last month
Related projects ⓘ
Alternatives and complementary repositories for DLFrameworkTest
- Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS☆17Updated 2 years ago
- TiledKernel is a code generation library based on macro kernels and memory hierarchy graph data structure.☆19Updated 6 months ago
- modified cutlass☆14Updated 4 years ago
- Artifacts for SOSP'19 paper Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions☆21Updated 2 years ago
- Optimize GEMM with tensorcore step by step☆15Updated 11 months ago
- An experimental ahead of time compiler for Relay.☆51Updated 4 years ago
- ☆11Updated 3 years ago
- Static analysis framework for analyzing programs written in TVM's Relay IR.☆27Updated 5 years ago
- Emulating DMA Engines on GPUs for Performance and Portability☆34Updated 9 years ago
- An external memory allocator example for PyTorch.☆13Updated 3 years ago
- PyTorch compilation tutorial covering TorchScript, torch.fx, and Slapo☆19Updated last year
- ThrillerFlow is a Dataflow Analysis and Codegen Framework written in Rust.☆11Updated last month
- ☆16Updated this week
- An Attention Superoptimizer☆20Updated 6 months ago
- ☆18Updated last month
- 使用 CUDA C++ 实现的 llama 模型推理框架☆24Updated 2 weeks ago
- ☆23Updated 9 months ago
- ☆22Updated 4 years ago
- PTX-EMU is a simple emulator for CUDA program.☆24Updated 10 months ago
- This is a demo how to write a high performance convolution run on apple silicon☆52Updated 2 years ago
- Visualize TVM Relay program graph☆12Updated 5 years ago
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆85Updated 8 months ago
- study of cutlass☆19Updated last week
- An IR for efficiently simulating distributed ML computation.☆25Updated 10 months ago
- TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.☆156Updated this week
- An MLIR-based toy DL compiler for TVM Relay.☆53Updated 2 years ago
- Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation☆26Updated 5 years ago
- Optimize tensor program fast with Felix, a gradient descent autotuner.☆19Updated 6 months ago