void-main / FasterTransformerLinks

Transformer related optimization, including BERT, GPT

☆59

Alternatives and similar repositories for FasterTransformer

Users that are interested in FasterTransformer are comparing it to the libraries listed below

Sorting:

OpenPPL / ppl.nn.llm
☆140Updated last year
OpenPPL / ppl.llm.serving
☆130Updated 11 months ago
bytedance / ByteTransformer
optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052
☆477Updated last year
bytedance / effective_transformer
Running BERT without Padding
☆475Updated 3 years ago
AniZpZ / AutoSmoothQuant
An easy-to-use package for implementing SmoothQuant for LLMs
☆109Updated 7 months ago
void-main / fastertransformer_backend
☆22Updated 2 years ago
OpenPPL / ppl.llm.kernel.cuda
☆152Updated 10 months ago
flagos-ai / FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
☆283Updated last year
luchangli03 / export_llama_to_onnx
export llama to onnx
☆137Updated 11 months ago
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆43Updated 9 months ago
THUDM / FasterTransformer
Transformer related optimization, including BERT, GPT
☆39Updated 2 years ago
Ascend / AscendSpeed
☆79Updated last year
madsys-dev / deepseekv2-profile
☆152Updated 8 months ago
kwai / Megatron-Kwai
LLM training technologies developed by kwai
☆66Updated last week
fanshiqing / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆169Updated last month
neuralmagic / AutoFP8
☆205Updated 6 months ago
anyscale / llm-continuous-batching-benchmarks
☆122Updated last year
YellowOldOdd / SDBI
Simple Dynamic Batching Inference
☆145Updated 3 years ago
AlibabaPAI / FLASHNN
☆102Updated last year
OpenPPL / ppl.pmx
☆60Updated last year
Rayrtfr / FasterTransformer
Transformer related optimization, including BERT, GPT
☆17Updated 2 years ago
sail-sg / zero-bubble-pipeline-parallelism
Zero Bubble Pipeline Parallelism
☆437Updated 6 months ago
HandH1998 / QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆148Updated 3 months ago
volcengine / veGiantModel
☆218Updated 2 years ago
modelscope / dash-infer
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including …
☆267Updated 3 months ago
AlibabaPAI / torchacc
PyTorch distributed training acceleration framework
☆53Updated 3 months ago
AyakaGEMM / Hands-on-GEMM
☆144Updated last year
facebookresearch / LLM-QAT
Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"
☆321Updated 8 months ago
ByteDance-Seed / decoupleQ
A quantization algorithm for LLM
☆146Updated last year
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆96Updated 2 months ago