Accepted to MLSys 2026
☆70Updated this week
Alternatives and similar repositories for tokenweave
Users that are interested in tokenweave are comparing it to the libraries listed below
Sorting:
- Nex Venus Communication Library☆72Nov 17, 2025Updated 3 months ago
- Prototyp MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism☆26Apr 4, 2025Updated 10 months ago
- a simple API to use CUPTI☆11Aug 19, 2025Updated 6 months ago
- [ICML 2025] Efficiently Serving Large Multimodal Models Using EPD Disaggregation☆22May 29, 2025Updated 8 months ago
- AccelOpt: Self-improving Agents for AI Accelerator Kernel Optimization☆23Feb 18, 2026Updated last week
- RPCNIC: A High-Performance and Reconfigurable PCIe-attached RPC Accelerator [HPCA2025]☆13Dec 9, 2024Updated last year
- ☆131Nov 11, 2024Updated last year
- ☆19Jun 1, 2025Updated 8 months ago
- Use the tokenizer in parallel to achieve superior acceleration☆20Mar 21, 2024Updated last year
- ☆18Mar 4, 2025Updated 11 months ago
- Quantization in the Jagged Loss Landscape of Vision Transformers☆13Oct 22, 2023Updated 2 years ago
- Deduplication over dis-aggregated memory for Serverless Computing☆14Mar 21, 2022Updated 3 years ago
- NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer☆163Feb 11, 2026Updated 2 weeks ago
- A lightweight design for computation-communication overlap.☆221Jan 20, 2026Updated last month
- ☆65Apr 26, 2025Updated 10 months ago
- CUDA SGEMM optimization note☆15Oct 31, 2023Updated 2 years ago
- Intel Gaudi's Megatron DeepSpeed Large Language Models for training☆18Dec 19, 2024Updated last year
- A Triton JIT runtime and ffi provider in C++☆31Updated this week
- Sequence-level 1F1B schedule for LLMs.☆19Jun 4, 2024Updated last year
- ☆27Jan 7, 2025Updated last year
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆469Feb 21, 2026Updated last week
- LazyLog: A New Shared Log Abstraction for Low-Latency Applications☆43Apr 28, 2025Updated 9 months ago
- ☆30Sep 13, 2025Updated 5 months ago
- [NAACL 2025] MiLoRA: Harnessing Minor Singular Components for Parameter-Efficient LLM Finetuning☆19May 31, 2025Updated 9 months ago
- Manages vllm-nccl dependency☆17Jun 3, 2024Updated last year
- ☆261Jul 11, 2024Updated last year
- This is the implementation repository of our SOSP'24 paper: Aceso: Achieving Efficient Fault Tolerance in Memory-Disaggregated Key-Value …☆22Oct 20, 2024Updated last year
- Self-Supervised Alignment with Mutual Information☆20May 24, 2024Updated last year
- High-performance distributed data shuffling (all-to-all) library for MoE training and inference☆112Dec 31, 2025Updated 2 months ago
- ☆130Aug 18, 2025Updated 6 months ago
- Lightning In-Memory Object Store☆46Jan 22, 2022Updated 4 years ago
- A fast communication-overlapping library for tensor/expert parallelism on GPUs.☆1,261Aug 28, 2025Updated 6 months ago
- ☆23May 10, 2023Updated 2 years ago
- ☆23Feb 12, 2025Updated last year
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆106Jun 28, 2025Updated 8 months ago
- Framework to reduce autotune overhead to zero for well known deployments.☆96Sep 19, 2025Updated 5 months ago
- A large-scale simulation framework for LLM inference☆539Jul 25, 2025Updated 7 months ago
- NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…☆466Dec 31, 2025Updated last month
- FlashTile is a CUDA Tile IR compiler that is compatible with NVIDIA's tileiras, targeting SM70 through SM121 NVIDIA GPUs.☆51Feb 6, 2026Updated 3 weeks ago