HabanaAI / gaudi-pytorch-bridgeLinks
☆17Updated 3 months ago
Alternatives and similar repositories for gaudi-pytorch-bridge
Users that are interested in gaudi-pytorch-bridge are comparing it to the libraries listed below
Sorting:
- ☆152Updated 11 months ago
- ☆165Updated 7 months ago
- SYCL* Templates for Linear Algebra (SYCL*TLA) - SYCL based CUTLASS implementation for Intel GPUs☆59Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆85Updated this week
- Github mirror of trition-lang/triton repo.☆109Updated this week
- ☆253Updated last year
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆230Updated 2 years ago
- ☆110Updated last year
- OpenAI Triton backend for Intel® GPUs☆222Updated this week
- ☆32Updated 3 years ago
- A lightweight design for computation-communication overlap.☆200Updated 2 months ago
- Magicube is a high-performance library for quantized sparse matrix operations (SpMM and SDDMM) of deep learning on Tensor Cores.☆90Updated 3 years ago
- Artifact from "Hardware Compute Partitioning on NVIDIA GPUs". THIS IS A FORK OF BAKITAS REPO. I AM NOT ONE OF THE AUTHORS OF THE PAPER.☆47Updated last month
- Optimize GEMM with tensorcore step by step☆36Updated 2 years ago
- ☆50Updated 6 years ago
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆70Updated last year
- Implementation of TSM2L and TSM2R -- High-Performance Tall-and-Skinny Matrix-Matrix Multiplication Algorithms for CUDA☆35Updated 5 years ago
- Benchmark code for the "Online normalizer calculation for softmax" paper☆103Updated 7 years ago
- ☆51Updated 9 months ago
- ☆83Updated 3 years ago
- ☆112Updated 7 months ago
- Intel® Extension for DeepSpeed* is an extension to DeepSpeed that brings feature support with SYCL kernels on Intel GPU(XPU) device. Note…☆63Updated 5 months ago
- ☆103Updated last year
- ☆69Updated 6 months ago
- Chimera: bidirectional pipeline parallelism for efficiently training large-scale models.☆69Updated 9 months ago
- Artifacts of EVT ASPLOS'24☆28Updated last year
- Paella: Low-latency Model Serving with Virtualized GPU Scheduling☆66Updated last year
- [HPCA 2026] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.☆71Updated last week
- Shared Middle-Layer for Triton Compilation☆321Updated 2 weeks ago
- NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer☆147Updated 3 months ago