hpcaitech / EnergonAI
Large-scale model inference.
☆630Updated last year
Related projects: ⓘ
- Fast Inference Solutions for BLOOM☆556Updated last month
- Ongoing research training transformer language models at scale, including: BERT & GPT-2☆1,317Updated 6 months ago
- Examples of training models with hybrid parallelism using ColossalAI☆334Updated last year
- ☆411Updated 10 months ago
- Microsoft Automatic Mixed Precision Library☆507Updated this week
- Efficient Training (including pre-training and fine-tuning) for Big Models☆548Updated last month
- optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052☆451Updated 6 months ago
- Ongoing research training transformer language models at scale, including: BERT & GPT-2☆1,833Updated 2 weeks ago
- ☆520Updated 8 months ago
- LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training☆389Updated this week
- Best practice for training LLaMA models in Megatron-LM☆606Updated 8 months ago
- MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.☆1,843Updated last week
- ☆205Updated last year
- GPTQ inference Triton kernel☆273Updated last year
- Scalable PaLM implementation of PyTorch☆191Updated last year
- Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".☆1,877Updated 5 months ago
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding☆1,099Updated 7 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆562Updated 2 weeks ago
- Tutel MoE: An Optimized Mixture-of-Experts Implementation☆711Updated last week
- Official repository for LongChat and LongEval☆505Updated 3 months ago
- Running BERT without Padding☆456Updated 2 years ago
- FlagScale is a large model toolkit based on open-sourced projects.☆129Updated last week
- FlashInfer: Kernel Library for LLM Serving☆1,143Updated last week
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models☆1,183Updated 2 months ago
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference☆309Updated this week
- Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.☆260Updated last year
- Serving multiple LoRA finetuned LLM as one☆946Updated 4 months ago
- [ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization☆629Updated last month
- [NIPS2023] RRHF & Wombat☆789Updated 11 months ago
- ☆447Updated 3 months ago