NVIDIA/cuEmbed

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/NVIDIA/cuEmbed)

NVIDIA / cuEmbed

CUDA Embedding Lookup Kernel Library

☆48

Alternatives and similar repositories for cuEmbed

Users that are interested in cuEmbed are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

KuangjuX / AttnLink
View on GitHub
An experimental communicating attention kernel based on DeepEP.
☆34Jul 29, 2025Updated 11 months ago
nicolaswilde / amx-gemm-handwritten
View on GitHub
Handwritten GEMM using Intel AMX (Advanced Matrix Extension)
☆17Jan 11, 2025Updated last year
TiledTensor / TiledLower
View on GitHub
TiledLower is a Dataflow Analysis and Codegen Framework written in Rust.
☆13Nov 23, 2024Updated last year
tile-ai / AttentionEngine
View on GitHub
☆52May 19, 2025Updated last year
PASSIONLab / distributed_sddmm
View on GitHub
Distributed SDDMM Kernel
☆12Jul 8, 2022Updated 4 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
flashinfer-ai / cubloaty
View on GitHub
a size profiler for cuda binary
☆71Jan 15, 2026Updated 6 months ago
Dao-AILab / gemm-cublas
View on GitHub
☆22May 5, 2025Updated last year
Oneflow-Inc / dfccl
View on GitHub
☆27Feb 17, 2025Updated last year
TiledTensor / TiledBench
View on GitHub
Benchmark tests supporting the TiledCUDA library.
☆19Nov 19, 2024Updated last year
Bruce-Lee-LY / cuda_auto_tune
View on GitHub
NCU-driven iterative optimization workflow for CUDA/CUTLASS/Triton/CuTe DSL kernels.
☆23Apr 10, 2026Updated 3 months ago
HydraQYH / hp_rms_norm
View on GitHub
High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)
☆30Jan 22, 2026Updated 6 months ago
NVIDIA-Merlin / distributed-embeddings
View on GitHub
distributed-embeddings is a library for building large embedding based models in Tensorflow 2.
☆47Oct 17, 2023Updated 2 years ago
NVIDIA / nv-embedding-cache
View on GitHub
Fast hierarchical embedding cache for recommenders
☆22Jul 5, 2026Updated 2 weeks ago
KuangjuX / cuda-evolve-oss
View on GitHub
Autonomous GPU kernel optimization system driven by AI agents.
☆31Mar 29, 2026Updated 3 months ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
flagos-ai / libtriton_jit
View on GitHub
A Triton JIT runtime and ffi provider in C++
☆37Updated this week
vllm-project / tml-fa4
View on GitHub
FA4-based Relative Attention Kernel developed by TML and Colfax
☆17Jul 17, 2026Updated last week
cherichy / tilecute
View on GitHub
☆32Jul 2, 2025Updated last year
Triang-jyed-driung / i8muon
View on GitHub
Muon in Int8 Precision Made Possible
☆20Jun 18, 2026Updated last month
feifeibear / ChituAttention
View on GitHub
Quantized Attention on GPU
☆45Nov 22, 2024Updated last year
KuangjuX / NVSHMEM-Tutorial
View on GitHub
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆195Feb 11, 2026Updated 5 months ago
gty111 / PTX-EMU
View on GitHub
PTX-EMU is a simple emulator for CUDA program.
☆40Apr 25, 2025Updated last year
catswe / LinearKAN
View on GitHub
LinearKAN: A very fast implementation of Kolmogorov-Arnold Networks
☆20Nov 12, 2025Updated 8 months ago
facebookexperimental / CUTracer
View on GitHub
A dynamic binary instrumentation tool for tracing and analyzing CUDA kernel instructions.
☆72Updated this week
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
microsoft / TileFusion
View on GitHub
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆115Jun 28, 2025Updated last year
NVIDIA / ACCV-Lab
View on GitHub
Accelerated Computer Vision Lab (ACCV-Lab) is a systematic collection of packages with the common goal to facilitate end-to-end efficient…
☆61Updated this week
melonedo / algebraic-layouts
View on GitHub
☆23Aug 20, 2025Updated 11 months ago
ademeure / DeeperGEMM
View on GitHub
DeeperGEMM: crazy optimized version
☆86May 5, 2025Updated last year
sgl-project / DeepGEMM
View on GitHub
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
☆32Updated this week
NVIDIA / CompileIQ
View on GitHub
An Optimizer for Nvidia Compilers.
☆110Jul 3, 2026Updated 3 weeks ago
muriloboratto / NVSHEMEM
View on GitHub
Sample Codes using NVSHMEM on Multi-GPU
☆30Jan 22, 2023Updated 3 years ago
YJMSTR / flash-linear-attention
View on GitHub
FLA but cuTile
☆27Apr 17, 2026Updated 3 months ago
JiangLiSJTU / token-ring
View on GitHub
☆13Jan 7, 2025Updated last year
Open source password manager - Proton Pass • Ad
Securely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
HydraQYH / expert_specialization_moe
View on GitHub
Expert Specialization MoE Solution based on CUTLASS
☆27Apr 14, 2026Updated 3 months ago
WaveSpeedAI / QuantumAttention
View on GitHub
[WIP] Better (FP8) attention for Hopper
☆33Feb 24, 2025Updated last year
NVIDIA-Merlin / HierarchicalKV
View on GitHub
HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of…
☆208May 22, 2026Updated 2 months ago
Infrawaves / DeepEP_ibrc_dual-ports_multiQP
View on GitHub
Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport
☆75May 9, 2025Updated last year
xlite-dev / netron-vscode-extension
View on GitHub
☕️ A vscode extension for netron, support *.pdmodel, *.nb, *.onnx, *.pb, *.h5, *.tflite, *.pth, *.pt, *.mnn, *.param, etc.
☆14Jun 4, 2023Updated 3 years ago
GeeeekExplorer / kkbot
View on GitHub
A Feishu/Lark AI agent bot
☆15Feb 27, 2026Updated 4 months ago
NVIDIA / workbench-example-downloadable-nim
View on GitHub
An NVIDIA AI Workbench example project for building with a downloadable NVIDIA NIM
☆16Jun 20, 2026Updated last month