HabanaAI / gaudi-pytorch-bridgeLinks
☆17Updated last week
Alternatives and similar repositories for gaudi-pytorch-bridge
Users that are interested in gaudi-pytorch-bridge are comparing it to the libraries listed below
Sorting:
- ☆110Updated last year
- ☆165Updated 8 months ago
- ☆256Updated last year
- Github mirror of trition-lang/triton repo.☆119Updated last week
- A lightweight design for computation-communication overlap.☆209Updated 3 weeks ago
- ☆112Updated 8 months ago
- ☆158Updated 2 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆86Updated this week
- ☆32Updated 3 years ago
- Optimize GEMM with tensorcore step by step☆36Updated 2 years ago
- ☆50Updated 6 years ago
- ☆164Updated last year
- [HPCA 2026] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.☆76Updated 3 weeks ago
- ☆49Updated last year
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆70Updated last year
- ☆154Updated last year
- Magicube is a high-performance library for quantized sparse matrix operations (SpMM and SDDMM) of deep learning on Tensor Cores.☆91Updated 3 years ago
- ☆52Updated 10 months ago
- An extension library of WMMA API (Tensor Core API)☆109Updated last year
- OpenAI Triton backend for Intel® GPUs☆223Updated this week
- ☆41Updated 2 months ago
- CUTLASS and CuTe Examples☆117Updated last month
- Shared Middle-Layer for Triton Compilation☆323Updated last month
- ☆101Updated last year
- ☆104Updated last year
- Benchmark code for the "Online normalizer calculation for softmax" paper☆105Updated 7 years ago
- ☆83Updated 7 months ago
- SYCL* Templates for Linear Algebra (SYCL*TLA) - SYCL based CUTLASS implementation for Intel GPUs☆62Updated this week
- NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer☆153Updated 4 months ago
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆231Updated 2 years ago