ELS-RD / kernlLinks
Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.
β1,585Updated last year
Alternatives and similar repositories for kernl
Users that are interested in kernl are comparing it to the libraries listed below
Sorting:
- A Python-level JIT compiler designed to make unmodified PyTorch programs faster.β1,066Updated last year
- Efficient, scalable and enterprise-grade CPU/GPU inference server for π€ Hugging Face transformer models πβ1,689Updated last year
- MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.β2,073Updated 4 months ago
- Library for 8-bit optimizers and quantization routines.β780Updated 3 years ago
- Pipeline Parallelism for PyTorchβ780Updated last year
- Cramming the training of a (BERT-type) language model into limited compute.β1,350Updated last year
- An open-source efficient deep learning framework/compiler, written in python.β733Updated 2 months ago
- maximal update parametrization (Β΅P)β1,613Updated last year
- Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".β2,212Updated last year
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blaβ¦β2,883Updated this week
- Central place for the engineering/scaling WG: documentation, SLURM scripts and logs, compute environment and data.β1,006Updated last year
- Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Trainingβ1,841Updated this week
- PyTorch extensions for high performance and large scale training.β3,385Updated 6 months ago
- Training and serving large-scale neural networks with auto parallelization.β3,160Updated last year
- Parallelformers: An Efficient Model Parallelization Toolkit for Deploymentβ790Updated 2 years ago
- AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (Nβ¦β4,689Updated last week
- The official implementation of βSophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-trainingββ977Updated last year
- A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.β1,231Updated this week
- Implementation of RETRO, Deepmind's Retrieval based Attention net, in Pytorchβ875Updated 2 years ago
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Modelsβ1,543Updated last year
- Automatically split your PyTorch models on multiple GPUs for training & inferenceβ658Updated last year
- Foundation Architecture for (M)LLMsβ3,119Updated last year
- Ongoing research training transformer language models at scale, including: BERT & GPT-2β1,425Updated last year
- A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.β883Updated last week
- A pytorch quantization backend for optimumβ1,004Updated 2 weeks ago
- Fast Inference Solutions for BLOOMβ565Updated last year
- Serving multiple LoRA finetuned LLM as oneβ1,110Updated last year
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,295Updated 8 months ago
- β413Updated last year
- PyTorch native quantization and sparsity for training and inferenceβ2,489Updated this week