DLBlas: clean and efficient kernels
☆33Updated this week
Alternatives and similar repositories for DLBlas
Users that are interested in DLBlas are comparing it to the libraries listed below
Sorting:
- triton for dsa☆57Feb 12, 2026Updated 2 weeks ago
- ☆17Feb 19, 2024Updated 2 years ago
- DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit☆92Jan 26, 2026Updated last month
- In our implementation of Qwen-Image-Edit, we employ block causal attention to improve inference speed.☆37Feb 16, 2026Updated 2 weeks ago
- ☆43Nov 1, 2024Updated last year
- ☆11Oct 31, 2024Updated last year
- ☆22Dec 11, 2025Updated 2 months ago
- ☆18Feb 16, 2025Updated last year
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆46Jun 11, 2025Updated 8 months ago
- ☆21Jun 16, 2025Updated 8 months ago
- [ICML 2025] MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design☆22Jul 4, 2025Updated 7 months ago
- LaTeX Examples Document Source☆11Apr 9, 2024Updated last year
- Wraps libopus in dart, and additionally provides a dart friendly API for encoding and decoding☆16Jul 21, 2024Updated last year
- Code for paper: "Executing Arithmetic: Fine-Tuning Large Language Models as Turing Machines"☆11Oct 11, 2024Updated last year
- KsanaDiT: High-Performance DiT (Diffusion Transformer) Inference Framework for Video & Image Generation☆36Feb 6, 2026Updated 3 weeks ago
- ☆10Mar 2, 2024Updated 2 years ago
- Bagua tutorials.☆13Sep 4, 2022Updated 3 years ago
- rdiv!(::AbstractMatrix, ::UpperTriangular) and ldiv!(::LowerTriangular, ::AbstractMatrix)☆12Nov 18, 2024Updated last year
- Julia implementation of flash-attention operation for neural networks.☆11May 31, 2023Updated 2 years ago
- EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation☆27Jul 30, 2025Updated 7 months ago
- Distributed SDDMM Kernel☆12Jul 8, 2022Updated 3 years ago
- An MPI wrapper for the pytorch tensor library that is automatically differentiable☆10Mar 27, 2023Updated 2 years ago
- Flexible local Fourier analysis library.☆12Jun 22, 2021Updated 4 years ago
- Code for "Adaptive Self-improvement LLM Agentic System for ML Library Development" (ICML 2025)☆15Jan 6, 2026Updated last month
- SQL Optimizations using MLIR☆12Apr 5, 2020Updated 5 years ago
- Find (filtered) local maxima.☆16Jan 16, 2024Updated 2 years ago
- Distributed Communication-Optimal LU-factorization Algorithm☆12Aug 1, 2021Updated 4 years ago
- ☆37Oct 10, 2024Updated last year
- 校园音乐征集投票系统 A system for electing annual school music☆10Feb 14, 2026Updated 2 weeks ago
- [ACL 2024] Official PyTorch implementation of "IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact"☆47May 24, 2024Updated last year
- QAQ: Quality Adaptive Quantization for LLM KV Cache☆55Mar 27, 2024Updated last year
- Parallel element agglomeration algebraic multigrid upscaling and solvers.☆16Jul 25, 2025Updated 7 months ago
- This repository contains the results and code for the MLPerf™ Training v3.0 benchmark.☆12Aug 10, 2023Updated 2 years ago
- ☆11Jun 4, 2021Updated 4 years ago
- Expert Specialization MoE Solution based on CUTLASS☆27Jan 19, 2026Updated last month
- Automatic differentiation of FEniCS and Firedrake models in Julia☆13Mar 21, 2021Updated 4 years ago
- A local search system implementation using Elasticsearch for Wikipedia data indexing and retrieval.☆12May 17, 2025Updated 9 months ago
- Official Pytorch implementation of "Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models" [IEEE ICASSP 202…☆29Jan 18, 2026Updated last month
- ☆13Apr 2, 2024Updated last year