intel / intel-extension-for-deepspeedLinks

Intel® Extension for DeepSpeed* is an extension to DeepSpeed that brings feature support with SYCL kernels on Intel GPU(XPU) device. Note XPU is already supported in stock DeepSpeed (upstream).

☆63

Alternatives and similar repositories for intel-extension-for-deepspeed

Users that are interested in intel-extension-for-deepspeed are comparing it to the libraries listed below

Sorting:

intel / torch-ccl
oneCCL Bindings for Pytorch*
☆102Updated 2 months ago
intel / intel-xpu-backend-for-triton
OpenAI Triton backend for Intel® GPUs
☆212Updated this week
ROCm / TransformerEngine
☆51Updated this week
intel / torch-xpu-ops
☆60Updated this week
ROCm / rccl-tests
RCCL Performance Benchmark Tests
☆75Updated last week
uxlfoundation / oneCCL
oneAPI Collective Communications Library (oneCCL)
☆245Updated this week
NVIDIA / Fuser
A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
☆358Updated this week
intel / sycl-tla
SYCL* Templates for Linear Algebra (SYCL*TLA) - SYCL based CUTLASS implementation for Intel GPUs
☆41Updated this week
ROCm / aotriton
Ahead of Time (AOT) Triton Math Library
☆80Updated last week
Azure / msccl
Microsoft Collective Communication Library
☆66Updated 11 months ago
ROCm / rocSHMEM
rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
☆121Updated this week
wmmae / wmma_extension
An extension library of WMMA API (Tensor Core API)
☆106Updated last year
intel / xetla
☆63Updated 10 months ago
deepspeedai / DeepSpeed-Kernels
☆72Updated 7 months ago
meta-pytorch / triton-cpu
An experimental CPU backend for Triton (https//github.com/openai/triton)
☆45Updated 2 months ago
microsoft / mscclpp
MSCCL++: A GPU-driven communication stack for scalable AI applications
☆425Updated this week
north-numerical-computing / tensor-cores-numerical-behavior
Test suite for probing the numerical behavior of NVIDIA tensor cores
☆41Updated last year
ROCm / composable_kernel
Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
☆478Updated this week
facebookresearch / param
PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…
☆153Updated last week
facebookexperimental / triton
Github mirror of trition-lang/triton repo.
☆86Updated last week
mk1-project / quickreduce
QuickReduce is a performant all-reduce library designed for AMD ROCm that supports inline compression.
☆33Updated last month
vllm-project / flash-attention
Fast and memory-efficient exact attention
☆96Updated last week
ROCm / rccl
ROCm Communication Collectives Library (RCCL)
☆389Updated last week
ROCm / triton
Development repository for the Triton language and compiler
☆137Updated this week
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆266Updated 3 months ago
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆181Updated 2 weeks ago
HabanaAI / Model-References
Reference models for Intel(R) Gaudi(R) AI Accelerator
☆165Updated last month
ROCm / aiter
AI Tensor Engine for ROCm
☆287Updated this week
triton-lang / kernels
☆92Updated 11 months ago
triton-lang / triton-cpu
An experimental CPU backend for Triton
☆154Updated this week