mdy666 / mdy_tritonView external linksLinks
☆152Jul 4, 2025Updated 7 months ago
Alternatives and similar repositories for mdy_triton
Users that are interested in mdy_triton are comparing it to the libraries listed below
Sorting:
- ☆105Sep 9, 2024Updated last year
- 使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention☆79Aug 12, 2024Updated last year
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…☆30Jan 28, 2026Updated 2 weeks ago
- Source-to-Source Debuggable Derivatives in Pure Python☆15Jan 23, 2024Updated 2 years ago
- Implement Flash Attention using Cute.☆100Dec 17, 2024Updated last year
- DeeperGEMM: crazy optimized version☆74May 5, 2025Updated 9 months ago
- Distributed Compiler based on Triton for Parallel Systems☆1,350Feb 9, 2026Updated last week
- Transformers components but in Triton☆34May 9, 2025Updated 9 months ago
- Stable Diffusion+LCM在SG2300X上,纵享丝滑一秒出图☆17Nov 29, 2024Updated last year
- ☆288Updated this week
- Counting-Stars (★)☆83Nov 24, 2025Updated 2 months ago
- LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.☆858Dec 10, 2025Updated 2 months ago
- A bunch of kernels that might make stuff slower 😉☆75Updated this week
- Using FlexAttention to compute attention with different masking patterns☆47Sep 22, 2024Updated last year
- NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer☆163Updated this week
- Efficient triton implementation of Native Sparse Attention.☆263May 23, 2025Updated 8 months ago
- ☆97Mar 26, 2025Updated 10 months ago
- Puzzles for learning Triton☆2,296Nov 18, 2024Updated last year
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆106Jun 28, 2025Updated 7 months ago
- ☆124May 28, 2024Updated last year
- Cataloging released Triton kernels.☆294Sep 9, 2025Updated 5 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆235Jun 15, 2025Updated 8 months ago
- GPGPU-Sim 中文注释版代码,包含 GPGPU-Sim 模拟器的最新版代码,经过中文注释,以帮助中文用户更好地理解和使用该模拟器。☆28Dec 18, 2024Updated last year
- Text2speech & tone color conversion demo running on SG2300x 结合openvoice和emotivoice的TTS+即时克隆☆22Oct 30, 2024Updated last year
- 🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"☆965Feb 5, 2026Updated last week
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆160Oct 13, 2025Updated 4 months ago
- ☆223Nov 19, 2025Updated 2 months ago
- Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)☆24Jun 6, 2024Updated last year
- 给llvm17.0.6添加一个新后端Cpu0☆12Apr 22, 2024Updated last year
- A std::execution style runtime context and High Performance RPC Transport for using OpenUCX. Including CUDA/ROCM/... devices with RDMA.☆29Updated this week
- Advanced Formal Language Theory (263-5352-00L; Frühjahr 2023)☆10Feb 21, 2023Updated 2 years ago
- PyTorch implementation for PaLM: A Hybrid Parser and Language Model.☆10Jan 7, 2020Updated 6 years ago
- Official Implementation of ACL2023: Don't Parse, Choose Spans! Continuous and Discontinuous Constituency Parsing via Autoregressive Span …☆14Aug 25, 2023Updated 2 years ago
- WebResearcher: An Iterative Deep-Research Agent,迭代式深度研究智能体☆38Updated this week
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆326Updated this week
- Hands-On Practical MLIR Tutorial☆51Aug 21, 2025Updated 5 months ago
- Puzzles for learning Triton, play it with minimal environment configuration!☆630Dec 28, 2025Updated last month
- ☆42Nov 1, 2025Updated 3 months ago
- ☆50Jun 16, 2025Updated 8 months ago