OswaldHe / HMT-pytorch
[NAACL 2025] Official Implementation of "HMT: Hierarchical Memory Transformer for Long Context Language Processing"
☆69Updated 2 months ago
Alternatives and similar repositories for HMT-pytorch:
Users that are interested in HMT-pytorch are comparing it to the libraries listed below
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆126Updated 4 months ago
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance…☆148Updated 2 weeks ago
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"☆232Updated 2 months ago
- Fast and memory-efficient exact attention☆67Updated last month
- ☆78Updated 8 months ago
- PB-LLM: Partially Binarized Large Language Models☆151Updated last year
- Token Omission Via Attention☆126Updated 6 months ago
- [ACL 2024] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models☆87Updated 11 months ago
- ☆50Updated 5 months ago
- Here we will test various linear attention designs.☆60Updated last year
- Work in progress.☆56Updated 2 weeks ago
- ☆89Updated 7 months ago
- Understand and test language model architectures on synthetic tasks.☆192Updated last month
- ☆143Updated last year
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆59Updated 3 months ago
- ☆125Updated last year
- Using FlexAttention to compute attention with different masking patterns☆43Updated 7 months ago
- ☆81Updated last year
- Stick-breaking attention☆52Updated last month
- Easy control for Key-Value Constrained Generative LLM Inference(https://arxiv.org/abs/2402.06262)☆62Updated last year
- ☆36Updated 7 months ago
- [ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models☆30Updated 10 months ago
- ☆69Updated 2 months ago
- This repository contains code for the MicroAdam paper.☆18Updated 4 months ago
- Triton-based implementation of Sparse Mixture of Experts.☆210Updated 4 months ago
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆106Updated 6 months ago
- Some preliminary explorations of Mamba's context scaling.☆212Updated last year
- Yet another random morning idea to be quickly tried and architecture shared if it works; to allow the transformer to pause for any amount…☆53Updated last year
- Boosting 4-bit inference kernels with 2:4 Sparsity☆72Updated 7 months ago
- The official repo for "LLoCo: Learning Long Contexts Offline"☆116Updated 10 months ago