bytedance / effective_transformer
Running BERT without Padding
☆468Updated 2 years ago
Alternatives and similar repositories for effective_transformer:
Users that are interested in effective_transformer are comparing it to the libraries listed below
- optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052☆467Updated 10 months ago
- ☆211Updated last year
- Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.☆266Updated last year
- A collection of memory efficient attention operators implemented in the Triton language.☆229Updated 7 months ago
- Transformer related optimization, including BERT, GPT☆61Updated last year
- ☆411Updated last year
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference☆400Updated 2 weeks ago
- Zero Bubble Pipeline Parallelism☆309Updated 2 months ago
- Latency and Memory Analysis of Transformer Models for Training and Inference☆369Updated 2 months ago
- Microsoft Automatic Mixed Precision Library☆549Updated 3 months ago
- Simple Dynamic Batching Inference☆145Updated 2 years ago
- Fast implementation of BERT inference directly on NVIDIA (CUDA, CUBLAS) and Intel MKL☆542Updated 4 years ago
- ☆114Updated 10 months ago
- Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers☆204Updated 4 months ago
- A fast communication-overlapping library for tensor parallelism on GPUs.☆271Updated 2 months ago
- Pipeline Parallelism for PyTorch☆736Updated 4 months ago
- Slicing a PyTorch Tensor Into Parallel Shards☆298Updated 3 years ago
- Ring attention implementation with flash attention☆645Updated 3 weeks ago
- ☆140Updated 8 months ago
- DeepLearning Framework Performance Profiling Toolkit☆281Updated 2 years ago
- Tutel MoE: An Optimized Mixture-of-Experts Implementation☆746Updated this week
- A baseline repository of Auto-Parallelism in Training Neural Networks☆141Updated 2 years ago
- Scalable PaLM implementation of PyTorch☆192Updated 2 years ago
- A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.☆972Updated 3 months ago
- ☆127Updated 3 weeks ago
- BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.☆834Updated 2 weeks ago
- PyTorch bindings for CUTLASS grouped GEMM.☆84Updated 2 weeks ago
- Research and development for optimizing transformers☆125Updated 3 years ago
- HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of…☆136Updated last week
- Disaggregated serving system for Large Language Models (LLMs).☆438Updated 4 months ago