thanhhau097 / google_fast_or_slow
☆10Updated 8 months ago
Alternatives and similar repositories for google_fast_or_slow:
Users that are interested in google_fast_or_slow are comparing it to the libraries listed below
- This repository contains papers for a comprehensive survey on accelerated generation techniques in Large Language Models (LLMs).☆12Updated 7 months ago
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆22Updated 10 months ago
- ☆52Updated last week
- APOLLO: SGD-like Memory, AdamW-level Performance☆82Updated 2 weeks ago
- Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"☆114Updated 10 months ago
- ☆23Updated 2 months ago
- The implementation for MLSys 2023 paper: "Cuttlefish: Low-rank Model Training without All The Tuning"☆43Updated last year
- FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation☆46Updated 6 months ago
- [ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models☆28Updated 7 months ago
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).☆22Updated 7 months ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆56Updated 2 months ago
- Codebase for ICML'24 paper: Learning from Students: Applying t-Distributions to Explore Accurate and Efficient Formats for LLMs☆24Updated 6 months ago
- ☆74Updated last year
- A list of awesome neural symbolic papers.☆44Updated 2 years ago
- [ECCV 2022] SuperTickets: Drawing Task-Agnostic Lottery Tickets from Supernets via Jointly Architecture Searching and Parameter Pruning☆19Updated 2 years ago
- [EMNLP 2024] RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization☆27Updated 3 months ago
- The official implementation of the paper "Demystifying the Compression of Mixture-of-Experts Through a Unified Framework".☆52Updated 2 months ago
- Official code for the paper "Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark"☆10Updated 6 months ago
- VIT inference in triton because, why not?☆22Updated 7 months ago
- [ICLR 2023] "Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers" by Tianlong Chen*, Zhenyu Zhang*, Ajay Jaiswal…☆48Updated last year
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆68Updated 7 months ago
- Mixed precision training from scratch with Tensors and CUDA☆21Updated 8 months ago
- Triton implementation of FlashAttention2 that adds Custom Masks.☆88Updated 5 months ago
- Beyond KV Caching: Shared Attention for Efficient LLMs☆14Updated 6 months ago
- Patch convolution to avoid large GPU memory usage of Conv2D☆82Updated 7 months ago
- Official Code For Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM☆13Updated last year
- ☆59Updated 2 months ago
- ☆41Updated 2 months ago
- Repository for CPU Kernel Generation for LLM Inference☆25Updated last year
- 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.☆44Updated last year