bytedance / effective_transformer
Running BERT without Padding
☆460Updated 2 years ago
Related projects ⓘ
Alternatives and complementary repositories for effective_transformer
- optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052☆457Updated 8 months ago
- Transformer related optimization, including BERT, GPT☆60Updated last year
- ☆209Updated last year
- Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.☆263Updated last year
- ☆412Updated last year
- Microsoft Automatic Mixed Precision Library☆525Updated last month
- Fast implementation of BERT inference directly on NVIDIA (CUDA, CUBLAS) and Intel MKL☆528Updated 4 years ago
- DeepLearning Framework Performance Profiling Toolkit☆277Updated 2 years ago
- Simple Dynamic Batching Inference☆145Updated 2 years ago
- Zero Bubble Pipeline Parallelism☆281Updated last week
- A collection of memory efficient attention operators implemented in the Triton language.☆219Updated 5 months ago
- ☆111Updated 8 months ago
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference☆357Updated this week
- ☆140Updated 6 months ago
- HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of…☆133Updated 2 months ago
- Latency and Memory Analysis of Transformer Models for Training and Inference☆355Updated last week
- A baseline repository of Auto-Parallelism in Training Neural Networks☆142Updated 2 years ago
- Slicing a PyTorch Tensor Into Parallel Shards☆296Updated 3 years ago
- LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training☆390Updated this week
- Best practice for training LLaMA models in Megatron-LM☆628Updated 10 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆68Updated 4 months ago
- BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.☆816Updated this week
- Prune a model while finetuning or training.☆394Updated 2 years ago
- a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.☆1,484Updated last year
- Disaggregated serving system for Large Language Models (LLMs).☆359Updated 3 months ago
- Common source, scripts and utilities for creating Triton backends.☆295Updated this week
- ☆196Updated last year
- Pipeline Parallelism for PyTorch☆726Updated 2 months ago
- Universal cross-platform tokenizers binding to HF and sentencepiece☆274Updated this week
- High performance distributed framework for training deep learning recommendation models based on PyTorch.☆397Updated this week