fire / pytorch-nncp
☆10Updated 2 years ago
Alternatives and similar repositories for pytorch-nncp:
Users that are interested in pytorch-nncp are comparing it to the libraries listed below
- Dzip: improved general-purpose lossless compression based on novel neural network modeling☆70Updated 3 years ago
- This repository contains the source code and dataset link mentioned in WWW 2022 accepted paper "TRACE:A Fast Transformer-based General-Pu…☆28Updated 3 years ago
- An implementation of LLMzip using GPT-2☆12Updated last year
- ☆50Updated 3 months ago
- Hutter Prize Submission☆26Updated 6 months ago
- ☆13Updated last year
- Hutter Prize Submission☆14Updated 3 years ago
- QuIP quantization☆51Updated last year
- Repository for CPU Kernel Generation for LLM Inference☆26Updated last year
- Fraunhofer Neural Network Encoder/Decoder (NNCodec)☆77Updated last year
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆340Updated 8 months ago
- ☆132Updated 7 months ago
- [ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration☆208Updated 5 months ago
- ☆219Updated 10 months ago
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees" adapted for Llama models☆35Updated last year
- Explorations into some recent techniques surrounding speculative decoding☆259Updated 4 months ago
- Low-bit optimizers for PyTorch☆128Updated last year
- Here we will test various linear attention designs.☆60Updated last year
- The implementation for MLSys 2023 paper: "Cuttlefish: Low-rank Model Training without All The Tuning"☆44Updated last year
- 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.☆47Updated last year
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"☆363Updated last year
- Fast and low-memory attention layer written in CUDA☆17Updated last year
- Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind☆123Updated 8 months ago
- ☆126Updated last month
- [ICLR 2023] Official implementation of Transnormer in our ICLR 2023 paper - Toeplitz Neural Network for Sequence Modeling☆79Updated last year
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆159Updated 9 months ago
- The official code for Dropping Backward Propagation (DropBP)☆30Updated 5 months ago
- Experimental playground for benchmarking language model (LM) architectures, layers, and tricks on smaller datasets. Designed for flexible…☆25Updated 3 weeks ago
- Fast Hadamard transform in CUDA, with a PyTorch interface☆174Updated 11 months ago
- A block oriented training approach for inference time optimization.☆32Updated 8 months ago