itsnamgyu / block-transformer
Block Transformer: Global-to-Local Language Modeling for Fast Inference (Official Code)
☆135Updated last month
Related projects ⓘ
Alternatives and complementary repositories for block-transformer
- Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasks☆129Updated 2 months ago
- ☆122Updated 9 months ago
- The official repo for "LLoCo: Learning Long Contexts Offline"☆113Updated 5 months ago
- [NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Models☆174Updated this week
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆78Updated this week
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"☆92Updated last month
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆134Updated 5 months ago
- [ICML'24] The official implementation of “Rethinking Optimization and Architecture for Tiny Language Models”☆118Updated 4 months ago
- Co-LLM: Learning to Decode Collaboratively with Multiple Language Models☆103Updated 6 months ago
- LongRoPE is a novel method that can extends the context window of pre-trained LLMs to an impressive 2048k tokens.☆103Updated 2 months ago
- ☆62Updated 3 months ago
- This is the official repository for Inheritune.☆105Updated last month
- ☆184Updated last month
- ☆96Updated last month
- Official implementation of "DoRA: Weight-Decomposed Low-Rank Adaptation"☆123Updated 6 months ago
- ☆63Updated last month
- PB-LLM: Partially Binarized Large Language Models☆148Updated last year
- [ACL 2024] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models☆68Updated 5 months ago
- Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind☆112Updated 2 months ago
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance…☆139Updated this week
- [ICLR 2024 Spotlight] Code for the paper "Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy"☆64Updated 5 months ago
- Official PyTorch implementation of DistiLLM: Towards Streamlined Distillation for Large Language Models (ICML 2024)☆138Updated 2 months ago
- Pytorch implementation for "Compressed Context Memory For Online Language Model Interaction" (ICLR'24)☆50Updated 7 months ago
- Low-bit optimizers for PyTorch☆119Updated last year
- ☆63Updated last month
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)☆199Updated 6 months ago
- ☆63Updated 4 months ago
- A framework to study AI models in Reasoning, Alignment, and use of Memory (RAM).☆145Updated 2 weeks ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆147Updated 4 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆104Updated last month