google-deepmind / language_modeling_is_compression
☆121Updated 5 months ago
Alternatives and similar repositories for language_modeling_is_compression:
Users that are interested in language_modeling_is_compression are comparing it to the libraries listed below
- Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]☆130Updated 4 months ago
- Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"☆153Updated 2 months ago
- ☆82Updated 4 months ago
- The HELMET Benchmark☆114Updated last week
- Language models scale reliably with over-training and on downstream tasks☆96Updated 10 months ago
- open-source code for paper: Retrieval Head Mechanistically Explains Long-Context Factuality☆172Updated 6 months ago
- [NAACL 2025] A Closer Look into Mixture-of-Experts in Large Language Models☆43Updated last week
- Some preliminary explorations of Mamba's context scaling.☆213Updated last year
- Easy control for Key-Value Constrained Generative LLM Inference(https://arxiv.org/abs/2402.06262)☆60Updated last year
- ☆58Updated 9 months ago
- Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"☆114Updated 11 months ago
- Simple and efficient pytorch-native transformer training and inference (batched)☆68Updated 10 months ago
- Triton-based implementation of Sparse Mixture of Experts.☆196Updated 2 months ago
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)☆149Updated last month
- [NeurIPS-2024] 📈 Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies https://arxiv.org/abs/2407.13623☆76Updated 4 months ago
- Repo for ACL2023 Findings paper "Emergent Modularity in Pre-trained Transformers"☆21Updated last year
- [ACL 2024] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models☆81Updated 8 months ago
- Replicating O1 inference-time scaling laws☆82Updated 2 months ago
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance…☆147Updated 3 weeks ago
- Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.☆55Updated last month
- ☆50Updated 2 months ago
- Repo of paper "Free Process Rewards without Process Labels"☆118Updated last month
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"☆220Updated 2 months ago
- Stick-breaking attention☆42Updated last month
- ☆125Updated last year
- ☆80Updated 11 months ago
- Code for the paper "VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment"☆120Updated 3 months ago
- Code accompanying the paper "Massive Activations in Large Language Models"☆140Updated 11 months ago
- ☆99Updated 11 months ago