thanhhau097 / google_fast_or_slow

☆10

Alternatives and similar repositories for google_fast_or_slow:

Users that are interested in google_fast_or_slow are comparing it to the libraries listed below

Arenaa / Accelerated-Generation-Techniques
This repository contains papers for a comprehensive survey on accelerated generation techniques in Large Language Models (LLMs).
☆12Updated 7 months ago
ziplab / QLLM
[ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…
☆22Updated 10 months ago
NVlabs / COAT
☆52Updated last week
zhuhanqing / APOLLO
APOLLO: SGD-like Memory, AdamW-level Performance
☆82Updated 2 weeks ago
teelinsan / parallel-decoding
Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"
☆114Updated 10 months ago
ScalingIntelligence / CATS
☆23Updated 2 months ago
hwang595 / Cuttlefish
The implementation for MLSys 2023 paper: "Cuttlefish: Low-rank Model Training without All The Tuning"
☆43Updated last year
LiqunMa / FBI-LLM
FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation
☆46Updated 6 months ago
GATECH-EIC / Linearized-LLM
[ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models
☆28Updated 7 months ago
yuzhenmao / IceFormer
Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).
☆22Updated 7 months ago
PiotrNawrot / nano-sparse-attention
The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.
☆56Updated 2 months ago
cornell-zhang / llm-datatypes
Codebase for ICML'24 paper: Learning from Students: Applying t-Distributions to Explore Accurate and Efficient Formats for LLMs
☆24Updated 6 months ago
insuhan / hyper-attn
☆74Updated last year
Ying1123 / awesome-neural-symbolic
A list of awesome neural symbolic papers.
☆44Updated 2 years ago
GATECH-EIC / SuperTickets
[ECCV 2022] SuperTickets: Drawing Task-Agnostic Lottery Tickets from Supernets via Jointly Architecture Searching and Parameter Pruning
☆19Updated 2 years ago
HuangOwen / RoLoRA
[EMNLP 2024] RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization
☆27Updated 3 months ago
CASE-Lab-UMD / Unified-MoE-Compression
The official implementation of the paper "Demystifying the Compression of Mixture-of-Experts Through a Unified Framework".
☆52Updated 2 months ago
UNITES-Lab / moe-quantization
Official code for the paper "Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark"
☆10Updated 6 months ago
indri-voice / vit.triton
VIT inference in triton because, why not?
☆22Updated 7 months ago
VITA-Group / Random-MoE-as-Dropout
[ICLR 2023] "Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers" by Tianlong Chen*, Zhenyu Zhang*, Ajay Jaiswal…
☆48Updated last year
softmax1 / Flash-Attention-Softmax-N
CUDA and Triton implementations of Flash Attention with SoftmaxN.
☆68Updated 7 months ago
tspeterkim / mixed-precision-from-scratch
Mixed precision training from scratch with Tensors and CUDA
☆21Updated 8 months ago
alexzhang13 / flashattention2-custom-mask
Triton implementation of FlashAttention2 that adds Custom Masks.
☆88Updated 5 months ago
metacarbon / shareAtt
Beyond KV Caching: Shared Attention for Efficient LLMs
☆14Updated 6 months ago
mit-han-lab / patch_conv
Patch convolution to avoid large GPU memory usage of Conv2D
☆82Updated 7 months ago
ilur98 / DGQ
Official Code For Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM
☆13Updated last year
mit-han-lab / tinychat-tutorial
☆59Updated 2 months ago
TianjinYellow / EdgeDeviceLLMCompetition-Starting-Kit
☆41Updated 2 months ago
IST-DASLab / QIGen
Repository for CPU Kernel Generation for LLM Inference
☆25Updated last year
kyegomez / Blockwise-Parallel-Transformer
32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.
☆44Updated last year