GreenBitAI / green-bit-llm
A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.
☆72Updated 3 weeks ago
Related projects ⓘ
Alternatives and complementary repositories for green-bit-llm
- QuIP quantization☆46Updated 7 months ago
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆70Updated this week
- PB-LLM: Partially Binarized Large Language Models☆146Updated 11 months ago
- Repo hosting codes and materials related to speeding LLMs' inference using token merging.☆29Updated 6 months ago
- A toolkit enhances PyTorch with specialized functions for low-bit quantized neural networks.☆28Updated 4 months ago
- ☆182Updated 3 weeks ago
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance…☆137Updated this week
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆38Updated 9 months ago
- A repository for research on medium sized language models.☆74Updated 5 months ago
- Cascade Speculative Drafting☆26Updated 7 months ago
- The official repo for "LLoCo: Learning Long Contexts Offline"☆110Updated 4 months ago
- Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasks☆129Updated last month
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆222Updated last month
- ☆52Updated 5 months ago
- ☆60Updated last week
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆171Updated 3 months ago
- Data preparation code for Amber 7B LLM☆82Updated 6 months ago
- ☆95Updated last month
- ☆35Updated last week
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"☆91Updated last month
- KV cache compression for high-throughput LLM inference☆82Updated this week
- Spherical Merge Pytorch/HF format Language Models with minimal feature loss.☆111Updated last year
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆104Updated last month
- ☆121Updated 9 months ago
- A pipeline for LLM knowledge distillation☆77Updated 3 months ago
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated 3 weeks ago
- ☆49Updated 7 months ago
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆86Updated 3 months ago
- My fork os allen AI's OLMo for educational purposes.☆28Updated 6 months ago
- GPTQLoRA: Efficient Finetuning of Quantized LLMs with GPTQ☆96Updated last year