alexisrozhkov / dilated-self-attention
Implementation of the dilated self attention as described in "LongNet: Scaling Transformers to 1,000,000,000 Tokens"
☆13Updated last year
Alternatives and similar repositories for dilated-self-attention:
Users that are interested in dilated-self-attention are comparing it to the libraries listed below
- (Unofficial) Implementation of dilated attention from "LongNet: Scaling Transformers to 1,000,000,000 Tokens" (https://arxiv.org/abs/2307…☆50Updated last year
- ☆50Updated 5 months ago
- Multipack distributed sampler for fast padding-free training of LLMs☆188Updated 8 months ago
- ☆125Updated last year
- QuIP quantization☆51Updated last year
- The simplest, fastest repository for training/finetuning medium-sized GPTs.☆105Updated 5 months ago
- Exploring finetuning public checkpoints on filter 8K sequences on Pile☆115Updated 2 years ago
- PB-LLM: Partially Binarized Large Language Models☆151Updated last year
- Recurrent Memory Transformer☆149Updated last year
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"☆97Updated 6 months ago
- GPTQLoRA: Efficient Finetuning of Quantized LLMs with GPTQ☆103Updated last year
- A fast implementation of T5/UL2 in PyTorch using Flash Attention☆102Updated last month
- Spherical Merge Pytorch/HF format Language Models with minimal feature loss.☆120Updated last year
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆82Updated last month
- ☆101Updated 10 months ago
- The official repo for "LLoCo: Learning Long Contexts Offline"☆116Updated 10 months ago
- minimal pytorch implementation of bm25 (with sparse tensors)☆100Updated last year
- Implementation of CALM from the paper "LLM Augmented LLMs: Expanding Capabilities through Composition", out of Google Deepmind☆174Updated 7 months ago
- some common Huggingface transformers in maximal update parametrization (µP)☆80Updated 3 years ago
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …☆59Updated 6 months ago
- This repo is based on https://github.com/jiaweizzhao/GaLore☆26Updated 7 months ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆59Updated 3 months ago
- ☆33Updated 10 months ago
- EvaByte: Efficient Byte-level Language Models at Scale☆88Updated this week
- Understand and test language model architectures on synthetic tasks.☆192Updated last month
- Official repository of Sparse ISO-FLOP Transformations for Maximizing Training Efficiency☆25Updated 8 months ago
- Comprehensive analysis of difference in performance of QLora, Lora, and Full Finetunes.☆82Updated last year
- Large scale 4D parallelism pre-training for 🤗 transformers in Mixture of Experts *(still work in progress)*☆82Updated last year
- The official implementation of the paper "What Matters in Transformers? Not All Attention is Needed".☆167Updated 3 weeks ago
- Code for NeurIPS LLM Efficiency Challenge☆57Updated last year