tonyzhang617 / nomad-distView external linksLinks
☆38Mar 14, 2024Updated last year
Alternatives and similar repositories for nomad-dist
Users that are interested in nomad-dist are comparing it to the libraries listed below
Sorting:
- ☆16Jul 24, 2023Updated 2 years ago
- Residual vector quantization for KV cache compression in large language model☆11Oct 22, 2024Updated last year
- Tools and APIs to develop weavers for the LARA language (LARA Compiler, LARA Interpreter, Weaver Generator, etc...)☆16Feb 5, 2026Updated last week
- Differentiable Clustering with Perturbed Random Forests, NeurIPS2023☆13Oct 16, 2023Updated 2 years ago
- Multi-branch model for concurrent execution☆18Jun 27, 2023Updated 2 years ago
- This adds partial support of AVX2 and AVX-512 to gem5.☆15Dec 19, 2023Updated 2 years ago
- Lightning Training strategy for HiveMind☆18Jan 20, 2026Updated 3 weeks ago
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆72Sep 8, 2024Updated last year
- Starlight: A Kernel Optimizer for GPU Processing☆16Jan 10, 2024Updated 2 years ago
- The official implementation of the DAC 2024 paper GQA-LUT☆20Dec 20, 2024Updated last year
- ☆83Apr 1, 2024Updated last year
- ☆20Sep 28, 2024Updated last year
- Code for the paper: https://arxiv.org/pdf/2309.06979.pdf☆21Jul 29, 2024Updated last year
- ☆31Apr 2, 2025Updated 10 months ago
- Opara is a lightweight and resource-aware DNN Operator parallel scheduling framework to accelerate the execution of DNN inference on GPUs…☆23Dec 19, 2024Updated last year
- Memory-Bounded GPU Acceleration for Vector Search☆33Dec 29, 2025Updated last month
- ☆24Feb 20, 2024Updated last year
- [HPCA 2026] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.☆80Dec 18, 2025Updated last month
- ☆33Mar 31, 2025Updated 10 months ago
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆112Oct 15, 2024Updated last year
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆32Apr 2, 2025Updated 10 months ago
- ☆38Jul 9, 2024Updated last year
- [ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models☆35Jun 12, 2024Updated last year
- Parallel network flows using OpenMP and CUDA.☆28Nov 21, 2018Updated 7 years ago
- ☆71Mar 26, 2025Updated 10 months ago
- CSiBE☆34Feb 17, 2022Updated 4 years ago
- Distributed ML Training Benchmarks☆27Mar 1, 2023Updated 2 years ago
- Low-bit LLM inference on CPU/NPU with lookup table☆923Jun 5, 2025Updated 8 months ago
- PetPS: Supporting Huge Embedding Models with Tiered Memory☆33May 21, 2024Updated last year
- FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores☆342Dec 28, 2024Updated last year
- Microarchitecture implementation of the decoupled vector-fetch accelerator☆163Jan 25, 2024Updated 2 years ago
- PIM-DL: Expanding the Applicability of Commodity DRAM-PIMs for Deep Learning via Algorithm-System Co-Optimization☆36Feb 21, 2024Updated last year
- [ICLR 2025] Official Code Release for Explaining Modern Gated-Linear RNNs via a Unified Implicit Attention Formulation☆49Mar 1, 2025Updated 11 months ago
- C++ Hough Forests with OpenCV☆11Jul 28, 2016Updated 9 years ago
- ☆15Dec 11, 2024Updated last year
- ☆25Nov 12, 2025Updated 3 months ago
- ☆22Dec 23, 2025Updated last month
- Yet another Linux distro for RISC-V.☆13Dec 25, 2025Updated last month
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆46Jun 11, 2025Updated 8 months ago