LLM Quantization toolkit
☆20May 2, 2026Updated 3 weeks ago
Alternatives and similar repositories for lm-quant-toolkit
Users that are interested in lm-quant-toolkit are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- [ICLRW'26] EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation☆45Apr 21, 2026Updated last month
- ☆48May 9, 2026Updated 2 weeks ago
- Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 30+ benchmarks☆15Feb 17, 2025Updated last year
- Pytorch implementation of our paper accepted by ICML 2023 -- "Bi-directional Masks for Efficient N:M Sparse Training"☆13Jun 7, 2023Updated 2 years ago
- OnePlus 8T Param Read/Write☆14Dec 4, 2020Updated 5 years ago
- Deploy open-source AI quickly and easily - Special Bonus Offer • AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models☆26Sep 14, 2025Updated 8 months ago
- ☆12Apr 4, 2024Updated 2 years ago
- Cross-Self KV Cache Pruning for Efficient Vision-Language Inference☆10Dec 15, 2024Updated last year
- BESA is a differentiable weight pruning technique for large language models.☆17Mar 4, 2024Updated 2 years ago
- ☆17May 2, 2024Updated 2 years ago
- Pytorch code of [CVPR 2023] "NAR-Former: Neural Architecture Representation Learning towards Holistic Attributes Prediction".☆11Mar 14, 2023Updated 3 years ago
- ☆21Feb 5, 2024Updated 2 years ago
- channel pruning for accelerating very deep neural networks☆13Mar 8, 2021Updated 5 years ago
- [NAACL 2025] MiLoRA: Harnessing Minor Singular Components for Parameter-Efficient LLM Finetuning☆20May 31, 2025Updated 11 months ago
- GPUs on demand by Runpod - Special Offer Available • AdRun AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
- Pytorch implementation of our paper accepted by NeurIPS 2022 -- Learning Best Combination for Efficient N:M Sparsity☆22Jan 13, 2023Updated 3 years ago
- Activation-aware Singular Value Decomposition for Compressing Large Language Models☆92Oct 22, 2024Updated last year
- This is the official Python version of CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Act…☆17Oct 25, 2024Updated last year
- Github Repo for OATS: Outlier-Aware Pruning through Sparse and Low Rank Decomposition☆20Apr 16, 2025Updated last year
- A fork of textgen that kept some things like Exllama and old GPTQ.☆22Aug 20, 2024Updated last year
- Reading notes on Speculative Decoding papers☆34Apr 16, 2026Updated last month
- XTD kernel for OnePlus 8 series build with latest clang☆19May 2, 2025Updated last year
- Evolutionary-Algorithm and Large-Language-Model☆23Nov 5, 2024Updated last year
- An advanced web browsing server for the Model Context Protocol (MCP) powered by Playwright, enabling headless browser interactions throug…☆27Mar 10, 2025Updated last year
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- rdiv!(::AbstractMatrix, ::UpperTriangular) and ldiv!(::LowerTriangular, ::AbstractMatrix)☆12Nov 18, 2024Updated last year
- PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [NeurIPS 2025]☆18Oct 11, 2025Updated 7 months ago
- Julia implementation of flash-attention operation for neural networks.☆11May 31, 2023Updated 2 years ago
- A model serving framework for various research and production scenarios. Seamlessly built upon the PyTorch and HuggingFace ecosystem.☆23Oct 11, 2024Updated last year
- The code for "AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference", Qingyue Yang, Jie Wang, Xing Li, Zhihai Wang, Ch…☆28Jul 15, 2025Updated 10 months ago
- ☆129Jan 22, 2024Updated 2 years ago
- This is the source code of our ICML25 paper, titled "Accelerating Large Language Model Reasoning via Speculative Search".☆23Jun 1, 2025Updated 11 months ago
- Sparse symmetric indefinite solver implemented with a runtime system☆13May 11, 2020Updated 6 years ago
- SQL Optimizations using MLIR☆12Apr 5, 2020Updated 6 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- ThinK: Thinner Key Cache by Query-Driven Pruning☆29Feb 11, 2025Updated last year
- Official PyTorch implementation of the paper "Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Princ…☆42Jul 18, 2025Updated 10 months ago
- 记录量化LLM中的总结。☆74Jan 8, 2026Updated 4 months ago
- An MPI wrapper for the pytorch tensor library that is automatically differentiable☆10Mar 27, 2023Updated 3 years ago
- Distributed SDDMM Kernel☆12Jul 8, 2022Updated 3 years ago
- ☆26Feb 22, 2024Updated 2 years ago
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees" adapted for Llama models☆41Aug 4, 2023Updated 2 years ago