lucidrains / simple-hierarchical-transformerView external linksLinks
Experiments around a simple idea for inducing multiple hierarchical predictive model within a GPT
☆224Aug 20, 2024Updated last year
Alternatives and similar repositories for simple-hierarchical-transformer
Users that are interested in simple-hierarchical-transformer are comparing it to the libraries listed below
Sorting:
- Implementation of MEGABYTE, Predicting Million-byte Sequences with Multiscale Transformers, in Pytorch☆655Dec 27, 2024Updated last year
- Some personal experiments around routing tokens to different autoregressive attention, akin to mixture-of-experts☆123Oct 17, 2024Updated last year
- Implementation of fused cosine similarity attention in the same style as Flash Attention☆220Feb 13, 2023Updated 3 years ago
- Yet another random morning idea to be quickly tried and architecture shared if it works; to allow the transformer to pause for any amount…☆53Oct 22, 2023Updated 2 years ago
- Explorations into the recently proposed Taylor Series Linear Attention☆100Aug 18, 2024Updated last year
- Exploring finetuning public checkpoints on filter 8K sequences on Pile☆115Mar 22, 2023Updated 2 years ago
- Trying to deconstruct RWKV in understandable terms☆14May 6, 2023Updated 2 years ago
- Implementation of the Llama architecture with RLHF + Q-learning☆170Feb 1, 2025Updated last year
- My attempts at applying Soundstream design on learned tokenization of text and then applying hierarchical attention to text generation☆90Oct 11, 2024Updated last year
- Implementation of Soft MoE, proposed by Brain's Vision team, in Pytorch☆344Apr 2, 2025Updated 10 months ago
- Implementation of an Attention layer where each head can attend to more than just one token, using coordinate descent to pick topk☆47Jul 16, 2023Updated 2 years ago
- ☆65Oct 4, 2023Updated 2 years ago
- Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate …☆641Jul 17, 2023Updated 2 years ago
- Implementation of Block Recurrent Transformer - Pytorch☆224Aug 20, 2024Updated last year
- An attempt to merge ESBN with Transformers, to endow Transformers with the ability to emergently bind symbols☆16Aug 3, 2021Updated 4 years ago
- Convolutions for Sequence Modeling☆911Jun 13, 2024Updated last year
- Implementation of RETRO, Deepmind's Retrieval based Attention net, in Pytorch☆879Oct 30, 2023Updated 2 years ago
- Hierarchical Attention Transformers (HAT)☆61Jan 12, 2024Updated 2 years ago
- Pytorch implementation of Compressive Transformers, from Deepmind☆163Oct 4, 2021Updated 4 years ago
- My explorations into editing the knowledge and memories of an attention network☆35Dec 8, 2022Updated 3 years ago
- Implementation of Rotary Embeddings, from the Roformer paper, in Pytorch☆804Jan 30, 2026Updated 2 weeks ago
- Zeta implementation of a reusable and plug in and play feedforward from the paper "Exponentially Faster Language Modeling"☆16Nov 11, 2024Updated last year
- Implementation of "compositional attention" from MILA, a multi-head attention variant that is reframed as a two-step attention process wi…☆51May 10, 2022Updated 3 years ago
- Implementation of a U-net complete with efficient attention as well as the latest research findings☆292May 3, 2024Updated last year
- 🦁 Lion, new optimizer discovered by Google Brain using genetic algorithms that is purportedly better than Adam(w), in Pytorch☆2,184Nov 27, 2024Updated last year
- ImageNet-12k subset of ImageNet-21k (fall11)☆21Jun 13, 2023Updated 2 years ago
- Sinkhorn Transformer - Practical implementation of Sparse Sinkhorn Attention☆270Aug 10, 2021Updated 4 years ago
- Simple large-scale training of stable diffusion with multi-node support.☆133May 8, 2023Updated 2 years ago
- Bayesian model reduction for probabilistic machine learning☆11Jul 3, 2025Updated 7 months ago
- GPT, but made only out of MLPs☆89May 25, 2021Updated 4 years ago
- Implementation of Discrete Key / Value Bottleneck, in Pytorch☆88Jul 9, 2023Updated 2 years ago
- Implementation of Mega, the Single-head Attention with Multi-headed EMA architecture that currently holds SOTA on Long Range Arena☆207Aug 26, 2023Updated 2 years ago
- Latent Diffusion Language Models☆70Sep 20, 2023Updated 2 years ago
- Implementation of Recurrent Interface Network (RIN), for highly efficient generation of images and video without cascading networks, in P…☆207Feb 14, 2024Updated 2 years ago
- ☆24Sep 25, 2024Updated last year
- Fine-Tuning Pre-trained Transformers into Decaying Fast Weights☆19Oct 9, 2022Updated 3 years ago
- Here we will test various linear attention designs.☆62Apr 25, 2024Updated last year
- Implementation of Denoising Diffusion for protein design, but using the new Equiformer (successor to SE3 Transformers) with some addition…☆57Dec 27, 2022Updated 3 years ago
- MACTA: A Multi-agent Reinforcement Learning Approach for Cache Timing Attacks and Detection☆46Apr 25, 2023Updated 2 years ago