Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS
☆34Feb 10, 2025Updated last year
Alternatives and similar repositories for Tacker
Users that are interested in Tacker are comparing it to the libraries listed below
Sorting:
- GoPTX: Fine-grained GPU Kernel Fusion by PTX-level Instruction Flow Weaving☆20Jul 30, 2025Updated 7 months ago
- Artifacts for SOSP'19 paper Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions☆21Apr 15, 2022Updated 3 years ago
- FTPipe and related pipeline model parallelism research.☆44May 16, 2023Updated 2 years ago
- Implementation of Hyena Hierarchy in JAX☆10Apr 30, 2023Updated 2 years ago
- libsmctrl论文的复现,添加了python端接口,可以在python端灵活调用接口 来分配计算资源☆12May 21, 2024Updated last year
- MAGIS: Memory Optimization via Coordinated Graph Transformation and Scheduling for DNN (ASPLOS'24)☆56May 29, 2024Updated last year
- ☆12May 24, 2022Updated 3 years ago
- Depict GPU memory footprint during DNN training of PyTorch☆11Nov 17, 2022Updated 3 years ago
- Studying GPU Multi-tenancy☆11Jan 11, 2019Updated 7 years ago
- ☆18Mar 4, 2025Updated 11 months ago
- Release doc/tutorial/wheels for poseidon-tf☆10Jan 18, 2018Updated 8 years ago
- An experimental tool to modify YAMLs without losing (most of) comment lines.☆16Sep 25, 2022Updated 3 years ago
- ☆68Jun 23, 2025Updated 8 months ago
- DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling☆21Feb 9, 2026Updated 3 weeks ago
- ☆13Nov 1, 2021Updated 4 years ago
- Distributed DRL by Ray and TensorFlow Tutorial.☆10Dec 26, 2019Updated 6 years ago
- A source-to-source compiler for optimizing CUDA dynamic parallelism by aggregating launches☆15Jun 21, 2019Updated 6 years ago
- [NeurIPS 2025] ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive☆66Dec 11, 2025Updated 2 months ago
- TiledLower is a Dataflow Analysis and Codegen Framework written in Rust.☆14Nov 23, 2024Updated last year
- Tutorials of Extending and importing TVM with CMAKE Include dependency.☆16Oct 11, 2024Updated last year
- GPU scheduler for elastic/distributed deep learning workloads in Kubernetes cluster (IC2E'23)☆34Nov 11, 2023Updated 2 years ago
- PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications☆127May 9, 2022Updated 3 years ago
- Handwritten GEMM using Intel AMX (Advanced Matrix Extension)☆17Jan 11, 2025Updated last year
- Machine Learning Inference Graph Spec☆21Jul 27, 2019Updated 6 years ago
- ☆115Nov 17, 2023Updated 2 years ago
- Open deep learning compiler stack for cpu, gpu and specialized accelerators☆19Feb 24, 2026Updated last week
- ☆16May 4, 2021Updated 4 years ago
- Paella: Low-latency Model Serving with Virtualized GPU Scheduling☆68May 1, 2024Updated last year
- Model-less Inference Serving☆94Nov 4, 2023Updated 2 years ago
- Experiments evaluating preemption on the NVIDIA Pascal architecture☆17Nov 10, 2016Updated 9 years ago
- ☆19Aug 26, 2021Updated 4 years ago
- Official resporitory for "IPDPS' 24 QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices".☆20Feb 23, 2024Updated 2 years ago
- POC implementation of "Accelerating HE Operations Using Key Decomposition"[KLSS23]☆18Jun 11, 2025Updated 8 months ago
- TiledKernel is a code generation library based on macro kernels and memory hierarchy graph data structure.☆19May 12, 2024Updated last year
- ☆19Nov 22, 2017Updated 8 years ago
- ☆38Jun 27, 2025Updated 8 months ago
- ☆78May 4, 2021Updated 4 years ago
- ☆20Sep 28, 2024Updated last year
- C++ Compile-Time eValuator for scheme☆21Jun 29, 2020Updated 5 years ago