pytorch-tpu / llama
Inference code for LLaMA models
☆20Updated 8 months ago
Alternatives and similar repositories for llama:
Users that are interested in llama are comparing it to the libraries listed below
- ☆97Updated 5 months ago
- ☆157Updated last year
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆114Updated 10 months ago
- ☆114Updated 10 months ago
- ☆58Updated 8 months ago
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆40Updated last year
- Research and development for optimizing transformers☆125Updated 3 years ago
- [NeurIPS'23] Speculative Decoding with Big Little Decoder☆88Updated 11 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆64Updated 4 months ago
- ☆85Updated 8 months ago
- Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers☆204Updated 5 months ago
- ☆25Updated last year
- Odysseus: Playground of LLM Sequence Parallelism☆64Updated 7 months ago
- ☆38Updated last year
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆86Updated this week
- Easy and Efficient Quantization for Transformers☆192Updated last month
- Official Pytorch Implementation of Length-Adaptive Transformer (ACL 2021)☆101Updated 4 years ago
- Fast low-bit matmul kernels in Triton☆199Updated last week
- Official repo to On the Generalization Ability of Retrieval-Enhanced Transformers☆37Updated 7 months ago
- Fast sparse deep learning on CPUs☆52Updated 2 years ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆232Updated 3 months ago
- ☆39Updated 11 months ago
- Benchmark code for the "Online normalizer calculation for softmax" paper☆64Updated 6 years ago
- ☆180Updated 6 months ago
- ring-attention experiments☆119Updated 3 months ago
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆90Updated last year
- FTPipe and related pipeline model parallelism research.☆41Updated last year
- ☆171Updated last week
- A Python library transfers PyTorch tensors between CPU and NVMe☆102Updated 2 months ago
- Examples for MS-AMP package.☆27Updated 9 months ago