lemyx / tilelang-dsaView external linksLinks
DeepSeek-V3.2-Exp DSA Warmup Lightning Indexer training operator based on tilelang
☆43Nov 19, 2025Updated 2 months ago
Alternatives and similar repositories for tilelang-dsa
Users that are interested in tilelang-dsa are comparing it to the libraries listed below
Sorting:
- Debug print operator for cudagraph debugging☆14Aug 2, 2024Updated last year
- ☆35Mar 7, 2025Updated 11 months ago
- DeeperGEMM: crazy optimized version☆73May 5, 2025Updated 9 months ago
- Tutorial Exercises and Code for GPU Communications Tutorial at HOT Interconnects 2025☆27Oct 22, 2025Updated 3 months ago
- Benchmark tests supporting the TiledCUDA library.☆18Nov 19, 2024Updated last year
- My tests and experiments with some popular dl frameworks.☆17Sep 11, 2025Updated 5 months ago
- Noisy language compiler☆17Jul 31, 2024Updated last year
- Vortex: A Flexible and Efficient Sparse Attention Framework☆46Jan 21, 2026Updated 3 weeks ago
- Building the Virtuous Cycle for AI-driven LLM Systems☆164Updated this week
- Open deep learning compiler stack for cpu, gpu and specialized accelerators☆19Updated this week
- Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.☆122Jan 1, 2026Updated last month
- ☆39Dec 14, 2025Updated last month
- ☆38Aug 7, 2025Updated 6 months ago
- MSLK (Meta Superintelligence Labs Kernels) is a collection of PyTorch GPU operator libraries that are designed and optimized for GenAI tr…☆50Updated this week
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Jul 21, 2023Updated 2 years ago
- An Open-Source RAG Workload Trace to Optimize RAG Serving Systems☆35Nov 18, 2025Updated 2 months ago
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆96Sep 13, 2025Updated 5 months ago
- ☆88May 31, 2025Updated 8 months ago
- [HPCA 2026] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.☆80Dec 18, 2025Updated last month
- python package of rocm-smi-lib☆24Dec 15, 2025Updated last month
- An experimental communicating attention kernel based on DeepEP.☆35Jul 29, 2025Updated 6 months ago
- Graphiler is a compiler stack built on top of DGL and TorchScript which compiles GNNs defined using user-defined functions (UDFs) into ef…☆59Oct 3, 2022Updated 3 years ago
- Flexible and Pluggable Serving Engine for Diffusion LLMs☆56Updated this week
- Official Code Repository for the paper "Key-value memory in the brain"☆31Feb 25, 2025Updated 11 months ago
- NVIDIA cuTile learn☆158Dec 9, 2025Updated 2 months ago
- SimplePIM is the first high-level programming framework for real-world processing-in-memory (PIM) architectures. Described in the PACT 20…☆31Oct 23, 2023Updated 2 years ago
- Lightweight Non-Parametric Embedding Fine-Tuning☆40Sep 13, 2025Updated 5 months ago
- ☆77Nov 5, 2024Updated last year
- Transformers components but in Triton☆34May 9, 2025Updated 9 months ago
- NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…☆462Dec 31, 2025Updated last month
- ☆38Jul 19, 2025Updated 6 months ago
- a size profiler for cuda binary☆72Jan 15, 2026Updated 3 weeks ago
- ☆25Sep 1, 2025Updated 5 months ago
- Cookbook of SGLang - Recipe☆73Updated this week
- KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches. EMNLP Findings 2024☆88Feb 27, 2025Updated 11 months ago
- Framework to reduce autotune overhead to zero for well known deployments.☆96Sep 19, 2025Updated 4 months ago
- [ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference☆283May 1, 2025Updated 9 months ago
- Official repository for the paper Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regressi…☆23Oct 1, 2025Updated 4 months ago
- ☆12Jul 4, 2024Updated last year