PiotrNawrot / sparse-frontier
Official implementation of "The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs"
☆21Updated this week
Alternatives and similar repositories for sparse-frontier:
Users that are interested in sparse-frontier are comparing it to the libraries listed below
- The official repository for SkyLadder: Better and Faster Pretraining via Context Window Scheduling☆29Updated last month
- Official repository of paper "RNNs Are Not Transformers (Yet): The Key Bottleneck on In-context Retrieval"☆26Updated last year
- Repository for the Q-Filters method (https://arxiv.org/pdf/2503.02812)☆30Updated last month
- Official repository of "LiNeS: Post-training Layer Scaling Prevents Forgetting and Enhances Model Merging"☆25Updated 5 months ago
- ☆31Updated 3 months ago
- Triton Implementation of HyperAttention Algorithm☆47Updated last year
- Code for the paper "Function-Space Learning Rates"☆19Updated last week
- ☆18Updated 9 months ago
- Using FlexAttention to compute attention with different masking patterns☆43Updated 7 months ago
- ☆78Updated 8 months ago
- Official repository for the paper "Approximating Two-Layer Feedforward Networks for Efficient Transformers"☆37Updated last year
- GoldFinch and other hybrid transformer components☆45Updated 9 months ago
- Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers☆17Updated last month
- Stick-breaking attention☆52Updated last month
- Exploration of automated dataset selection approaches at large scales.☆39Updated last month
- Code for the arXiv preprint "The Unreasonable Effectiveness of Easy Training Data"☆47Updated last year
- Tiny re-implementation of MDM in style of LLaDA and nano-gpt speedrun☆48Updated last month
- Official code for the paper "Attention as a Hypernetwork"☆30Updated 10 months ago
- ☆53Updated 9 months ago
- A basic pure pytorch implementation of flash attention☆16Updated 6 months ago
- Efficient Scaling laws and collaborative pretraining.☆16Updated 3 months ago
- ☆19Updated last month
- Implementation of Gradient Agreement Filtering, from Chaubard et al. of Stanford, but for single machine microbatches, in Pytorch☆24Updated 3 months ago
- Train a SmolLM-style llm on fineweb-edu in JAX/Flax with an assortment of optimizers.☆17Updated last month
- We introduce EMMET and unify model editing with popular algorithms ROME and MEMIT.☆17Updated 4 months ago
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"☆71Updated 5 months ago
- The repository contains code for Adaptive Data Optimization☆24Updated 4 months ago
- A repository for research on medium sized language models.☆76Updated 11 months ago
- ☆33Updated 10 months ago
- ☆14Updated 5 months ago