samchaineau / llm_slerp_generationLinks

Repo hosting codes and materials related to speeding LLMs' inference using token merging.

☆36

Alternatives and similar repositories for llm_slerp_generation

Users that are interested in llm_slerp_generation are comparing it to the libraries listed below

Sorting:

TRI-ML / linear_open_lm
A repository for research on medium sized language models.
☆78Updated last year
hahnyuan / PB-LLM
PB-LLM: Partially Binarized Large Language Models
☆156Updated last year
VITA-Group / WeLore
From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients. Ajay Jaiswal, Lu Yin, Zhenyu Zhang, Shiwei Liu,…
☆51Updated 6 months ago
IST-DASLab / QuEST
Work in progress.
☆74Updated 3 months ago
siyan-zhao / prepacking
The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …
☆60Updated last year
jeffreysijuntan / lloco
The official repo for "LLoCo: Learning Long Contexts Offline"
☆117Updated last year
FasterDecoding / BitDelta
☆202Updated 10 months ago
thepowerfuldeez / OLMo
My fork os allen AI's OLMo for educational purposes.
☆30Updated 10 months ago
IST-DASLab / SparseFinetuning
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
☆42Updated last year
NathanGodey / qfilters
Repository for the Q-Filters method (https://arxiv.org/pdf/2503.02812)
☆35Updated 7 months ago
Zyphra / tree_attention
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
☆130Updated 10 months ago
minyoungg / LTE
☆69Updated last year
chu-tianxiang / QuIP-for-all
QuIP quantization
☆59Updated last year
IST-DASLab / RoSA
Official implementation of the ICML 2024 paper RoSA (Robust Adaptation)
☆44Updated last year
hetailang / SqueezeAttention
☆38Updated last year
HanGuo97 / lq-lora
☆127Updated last year
SalesforceAIResearch / GemFilter
☆85Updated 9 months ago
PiotrNawrot / sparse-frontier
The evaluation framework for training-free sparse attention in LLMs
☆101Updated last week
tiiuae / onebitllms
Lightweight toolkit package to train and fine-tune 1.58bit Language models
☆92Updated 5 months ago
arcee-ai / DAM
☆55Updated 11 months ago
DeepAuto-AI / hip-attention
Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.
☆147Updated last week
RobertCsordas / moeut
☆86Updated last year
RobertCsordas / moe_attention
Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"
☆99Updated last year
itsnamgyu / block-transformer
Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)
☆162Updated 6 months ago
kyegomez / Infini-attention
Implementation of the paper: "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention" from Google in pyTO…
☆56Updated this week
whyNLP / LCKV
Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance…
☆155Updated 6 months ago
IST-DASLab / QIGen
Repository for CPU Kernel Generation for LLM Inference
☆26Updated 2 years ago
NolanoOrg / SpectraSuite
☆51Updated last year
Pleias / Various-Finetuning
Set of scripts to finetune LLMs
☆38Updated last year
scitix / MEAP
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
☆34Updated 5 months ago