JimyMa / FuncTsLinks
[DAC2024] A Holistic Functionalization Approach to Optimizing Imperative Tensor Programs in Deep Learning
☆15Updated last year
Alternatives and similar repositories for FuncTs
Users that are interested in FuncTs are comparing it to the libraries listed below
Sorting:
- ☆42Updated last year
- ☆117Updated last month
- Github mirror of trition-lang/triton repo.☆54Updated this week
- DeepSeek-V3/R1 inference performance simulator☆165Updated 5 months ago
- MAGIS: Memory Optimization via Coordinated Graph Transformation and Scheduling for DNN (ASPLOS'24)☆53Updated last year
- A baseline repository of Auto-Parallelism in Training Neural Networks☆145Updated 3 years ago
- A Easy-to-understand TensorOp Matmul Tutorial☆376Updated 11 months ago
- WaferLLM: Large Language Model Inference at Wafer Scale☆49Updated last month
- Summary of the Specs of Commonly Used GPUs for Training and Inference of LLM☆63Updated 3 weeks ago
- Automatic Mapping Generation, Verification, and Exploration for ISA-based Spatial Accelerators☆115Updated 2 years ago
- Summary of some awesome work for optimizing LLM inference☆103Updated 3 months ago
- ☆28Updated last year
- Development repository for the Triton-Linalg conversion☆197Updated 6 months ago
- ☆84Updated 5 months ago
- GoPTX: Fine-grained GPU Kernel Fusion by PTX-level Instruction Flow Weaving☆18Updated last month
- High performance Transformer implementation in C++.☆129Updated 7 months ago
- ☆81Updated 2 years ago
- A lightweight design for computation-communication overlap.☆161Updated this week
- ☆150Updated last year
- Compiler for Dynamic Neural Networks☆46Updated last year
- ☆23Updated 5 months ago
- Magicube is a high-performance library for quantized sparse matrix operations (SpMM and SDDMM) of deep learning on Tensor Cores.☆89Updated 2 years ago
- A benchmark suited especially for deep learning operators☆42Updated 2 years ago
- nnScaler: Compiling DNN models for Parallel Training☆118Updated this week
- gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling☆39Updated 2 weeks ago
- LLM serving cluster simulator☆108Updated last year
- ☆13Updated last year
- LLM Inference analyzer for different hardware platforms☆87Updated last month
- A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …☆249Updated 2 months ago
- Examples of CUDA implementations by Cutlass CuTe☆225Updated 2 months ago