yanghaku / tvm-rt-wasm
A High performance and tiny TVM graph executor library written in C which can compile to WebAssembly and use CUDA/WebGPU as the accelerator.
☆9Updated last year
Related projects ⓘ
Alternatives and complementary repositories for tvm-rt-wasm
- A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.☆22Updated last month
- ☆11Updated 3 years ago
- Experiments and prototypes associated with IREE or MLIR☆49Updated 3 months ago
- Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS☆17Updated 2 years ago
- Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.☆11Updated last year
- Triton to TVM transpiler.☆16Updated last month
- ☆152Updated this week
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆35Updated 6 months ago
- ☆40Updated 3 years ago
- ☆18Updated last month
- Open deep learning compiler stack for cpu, gpu and specialized accelerators☆33Updated last year
- Play with MLIR right in your browser☆124Updated last year
- PTX-EMU is a simple emulator for CUDA program.☆24Updated 10 months ago
- A new memory mapping interface for efficient direct user-space access to byte-addressable storage, published in MICRO2022.☆14Updated 2 years ago
- [CF ’20] Verified Instruction-Level Energy Consumption Measurement for NVIDIA GPUs☆15Updated 3 years ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆98Updated 2 months ago
- Noisy language compiler☆17Updated 3 months ago
- PyTorch compilation tutorial covering TorchScript, torch.fx, and Slapo☆19Updated last year
- ☆128Updated this week
- PSTensor provides a way to hack the memory management of tensors in TensorFlow and PyTorch by defining your own C++ Tensor Class.☆10Updated 2 years ago
- GPU Performance Advisor☆63Updated 2 years ago
- A source-to-source compiler for optimizing CUDA dynamic parallelism by aggregating launches☆13Updated 5 years ago
- Fast and memory-efficient exact attention☆28Updated 2 weeks ago
- Code for Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture (accepted by PVLDB).The outdated wr…☆8Updated last year
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆85Updated 8 months ago
- ☆13Updated 6 months ago
- TiledKernel is a code generation library based on macro kernels and memory hierarchy graph data structure.☆19Updated 6 months ago
- Handy tools & graphics API abstraction for blazing fast prototyping☆9Updated 9 months ago
- ☆17Updated 2 weeks ago
- IREE C++ Template☆16Updated 3 months ago