Anonymous1252022 / fp4-all-the-wayLinks
☆44Updated 7 months ago
Alternatives and similar repositories for fp4-all-the-way
Users that are interested in fp4-all-the-way are comparing it to the libraries listed below
Sorting:
- ☆114Updated last week
- Work in progress.☆77Updated last month
- The evaluation framework for training-free sparse attention in LLMs☆108Updated 3 months ago
- ☆163Updated 6 months ago
- ☆69Updated 6 months ago
- Official implementation for Training LLMs with MXFP4☆116Updated 8 months ago
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆111Updated last year
- ☆133Updated 7 months ago
- An extention to the GaLore paper, to perform Natural Gradient Descent in low rank subspace☆18Updated last year
- Boosting 4-bit inference kernels with 2:4 Sparsity☆90Updated last year
- ☆156Updated 11 months ago
- ☆85Updated 11 months ago
- KV cache compression for high-throughput LLM inference☆148Updated 11 months ago
- QuIP quantization☆61Updated last year
- Official code for the paper "Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark"☆27Updated 6 months ago
- QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning☆159Updated 2 months ago
- ☆117Updated 7 months ago
- An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantization☆171Updated last month
- [CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>☆153Updated last month
- ☆52Updated 7 months ago
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆128Updated 6 months ago
- Official PyTorch implementation of "GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance" (ICML 2025)☆50Updated 6 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆225Updated 7 months ago
- Code for data-aware compression of DeepSeek models☆68Updated last month
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆175Updated last year
- ☆31Updated last year
- LLM Inference with Microscaling Format☆34Updated last year
- [ICML 2025] SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models☆47Updated last year
- Code for the paper “Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling”☆110Updated this week
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆147Updated 2 months ago