Montinger / Transformer-Workbench
Playground for Transformers
☆48Updated last year
Alternatives and similar repositories for Transformer-Workbench:
Users that are interested in Transformer-Workbench are comparing it to the libraries listed below
- several types of attention modules written in PyTorch for learning purposes☆47Updated 5 months ago
- (Unofficial) PyTorch implementation of grouped-query attention (GQA) from "GQA: Training Generalized Multi-Query Transformer Models from …☆159Updated 10 months ago
- My fork os allen AI's OLMo for educational purposes.☆30Updated 3 months ago
- PyTorch implementation of Soft MoE by Google Brain in "From Sparse to Soft Mixtures of Experts" (https://arxiv.org/pdf/2308.00951.pdf)☆71Updated last year
- ☆47Updated 7 months ago
- Contextual Position Encoding but with some custom CUDA Kernels https://arxiv.org/abs/2405.18719☆22Updated 9 months ago
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"☆97Updated 6 months ago
- Unofficial implementation of https://arxiv.org/pdf/2407.14679☆44Updated 6 months ago
- ☆131Updated last year
- PyTorch implementation of moe, which stands for mixture of experts☆42Updated 4 years ago
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)☆150Updated 3 months ago
- This is a personal reimplementation of Google's Infini-transformer, utilizing a small 2b model. The project includes both model and train…☆56Updated 11 months ago
- A byte-level decoder architecture that matches the performance of tokenized Transformers.☆63Updated 11 months ago
- Implementation of the paper: "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆86Updated 2 weeks ago
- code for the ddp tutorial☆32Updated 2 years ago
- ☆29Updated last year
- Community Implementation of the paper: "Multi-Head Mixture-of-Experts" In PyTorch☆22Updated 2 months ago
- Pytorch Implementation of the paper: "Learning to (Learn at Test Time): RNNs with Expressive Hidden States"☆24Updated last week
- Experiments on Multi-Head Latent Attention☆80Updated 7 months ago
- MathPrompter Implementation: This repository hosts an implementation based on the 'MathPrompter: Mathematical Reasoning Using Large Langu…☆13Updated 8 months ago
- Implementation of Griffin from the paper: "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models"☆52Updated 2 months ago
- LoRA and DoRA from Scratch Implementations☆199Updated last year
- This code repository contains the code used for my "Optimizing Memory Usage for Training LLMs and Vision Transformers in PyTorch" blog po…☆87Updated last year
- This repository contains papers for a comprehensive survey on accelerated generation techniques in Large Language Models (LLMs).☆11Updated 10 months ago
- Tiled Flash Linear Attention library for fast and efficient mLSTM Kernels.☆47Updated last week
- ☆66Updated last week
- Exploration of the multi modal fuyu-8b model of Adept. 🤓 🔍☆28Updated last year
- A single repo with all scripts and utils to train / fine-tune the Mamba model with or without FIM☆54Updated 11 months ago
- [ICML'24] The official implementation of “Rethinking Optimization and Architecture for Tiny Language Models”☆121Updated 2 months ago
- ☆145Updated last year