☆38Mar 14, 2024Updated last year
Alternatives and similar repositories for nomad-dist
Users that are interested in nomad-dist are comparing it to the libraries listed below
Sorting:
- ☆17Jul 24, 2023Updated 2 years ago
- Residual vector quantization for KV cache compression in large language model☆12Oct 22, 2024Updated last year
- Tools and APIs to develop weavers for the LARA language (LARA Compiler, LARA Interpreter, Weaver Generator, etc...)☆16Feb 5, 2026Updated last month
- Multi-branch model for concurrent execution☆18Jun 27, 2023Updated 2 years ago
- This adds partial support of AVX2 and AVX-512 to gem5.☆15Dec 19, 2023Updated 2 years ago
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆73Sep 8, 2024Updated last year
- ☆14Nov 20, 2022Updated 3 years ago
- Starlight: A Kernel Optimizer for GPU Processing☆16Jan 10, 2024Updated 2 years ago
- The official implementation of the DAC 2024 paper GQA-LUT☆21Dec 20, 2024Updated last year
- ☆20Sep 28, 2024Updated last year
- Code for the paper: https://arxiv.org/pdf/2309.06979.pdf☆21Jul 29, 2024Updated last year
- Opara is a lightweight and resource-aware DNN Operator parallel scheduling framework to accelerate the execution of DNN inference on GPUs…☆23Dec 19, 2024Updated last year
- ☆24Feb 20, 2024Updated 2 years ago
- ☆32Apr 2, 2025Updated 11 months ago
- [HPCA 2026] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.☆81Dec 18, 2025Updated 2 months ago
- ☆33Mar 31, 2025Updated 11 months ago
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆31Apr 2, 2025Updated 11 months ago
- ☆38Updated this week
- [ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models☆35Jun 12, 2024Updated last year
- ☆71Mar 26, 2025Updated 11 months ago
- CSiBE☆34Feb 17, 2022Updated 4 years ago
- Distributed ML Training Benchmarks☆27Mar 1, 2023Updated 3 years ago
- Low-bit LLM inference on CPU/NPU with lookup table☆927Jun 5, 2025Updated 9 months ago
- EdgeCortix maintained and extended fork of Apache TVM compiler stack utilized by MERA framework. TVM is an open deep learning compiler st…☆11Dec 22, 2023Updated 2 years ago
- PetPS: Supporting Huge Embedding Models with Tiered Memory☆33May 21, 2024Updated last year
- GNU toolchain for Xuantie RISC-V CPU, including GCC and Binutils ……☆108Apr 12, 2025Updated 10 months ago
- FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores☆343Dec 28, 2024Updated last year
- ☆165Jun 22, 2025Updated 8 months ago
- Microarchitecture implementation of the decoupled vector-fetch accelerator☆162Jan 25, 2024Updated 2 years ago
- PIM-DL: Expanding the Applicability of Commodity DRAM-PIMs for Deep Learning via Algorithm-System Co-Optimization☆36Feb 21, 2024Updated 2 years ago
- ☆12May 23, 2024Updated last year
- ☆15Dec 11, 2024Updated last year
- Digital Innovation Festival React Typescript workshop☆10Jan 6, 2023Updated 3 years ago
- A curated list of awesome Gemini CLI extensions.☆35Feb 4, 2026Updated last month
- ☆22Dec 23, 2025Updated 2 months ago
- C++ Hough Forests with OpenCV☆11Jul 28, 2016Updated 9 years ago
- Notes and Examples to get started Parallel Computing with CUDA.☆13Nov 1, 2019Updated 6 years ago
- Yet another Linux distro for RISC-V.☆13Dec 25, 2025Updated 2 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆388Apr 13, 2025Updated 10 months ago