Agora-Lab-AI / BitNet-a4.8
BitNet a4.8 Implementation in one file of pytorch
☆13Updated 2 months ago
Alternatives and similar repositories for BitNet-a4.8:
Users that are interested in BitNet-a4.8 are comparing it to the libraries listed below
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆105Updated 5 months ago
- Tiny ASIC implementation for "The Era of 1-bit LLMs All Large Language Models are in 1.58 Bits" matrix multiplication unit☆130Updated 11 months ago
- [ICML 2024] BiLLM: Pushing the Limit of Post-Training Quantization for LLMs☆212Updated 2 months ago
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆155Updated 5 months ago
- ACL 2023☆39Updated last year
- FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation☆47Updated 8 months ago
- PB-LLM: Partially Binarized Large Language Models☆152Updated last year
- QuIP quantization☆52Updated last year
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆81Updated 3 weeks ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆125Updated 3 months ago
- ☆113Updated last week
- Repository for CPU Kernel Generation for LLM Inference☆25Updated last year
- PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation☆28Updated 4 months ago
- Advanced Ultra-Low Bitrate Compression Techniques for the LLaMA Family of LLMs☆111Updated last year
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆260Updated 5 months ago
- Repo hosting codes and materials related to speeding LLMs' inference using token merging.☆35Updated 11 months ago
- Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts☆113Updated 10 months ago
- SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models☆28Updated 7 months ago
- ☆87Updated last year
- [HPCA'21] SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning☆83Updated 7 months ago
- ☆47Updated 2 weeks ago
- ☆68Updated last year
- Cascade Speculative Drafting☆29Updated last year
- Compression for Foundation Models☆30Updated last week
- The homepage of OneBit model quantization framework.☆175Updated last month
- Efficient Infinite Context Transformers with Infini-attention Pytorch Implementation + QwenMoE Implementation + Training Script + 1M cont…☆81Updated 10 months ago
- ☆40Updated 5 months ago
- ☆24Updated 4 months ago
- ☆46Updated 8 months ago
- ☆32Updated last week