ELS-RD / kernlLinks
Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.
β1,585Updated last year
Alternatives and similar repositories for kernl
Users that are interested in kernl are comparing it to the libraries listed below
Sorting:
- A Python-level JIT compiler designed to make unmodified PyTorch programs faster.β1,067Updated last year
- Efficient, scalable and enterprise-grade CPU/GPU inference server for π€ Hugging Face transformer models πβ1,688Updated last year
- MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.β2,079Updated 5 months ago
- Pipeline Parallelism for PyTorchβ784Updated last year
- Library for 8-bit optimizers and quantization routines.β779Updated 3 years ago
- An open-source efficient deep learning framework/compiler, written in python.β736Updated 2 months ago
- maximal update parametrization (Β΅P)β1,636Updated last year
- Cramming the training of a (BERT-type) language model into limited compute.β1,353Updated last year
- Training and serving large-scale neural networks with auto parallelization.β3,167Updated last year
- AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (Nβ¦β4,695Updated last month
- Central place for the engineering/scaling WG: documentation, SLURM scripts and logs, compute environment and data.β1,006Updated last year
- PyTorch extensions for high performance and large scale training.β3,386Updated 7 months ago
- Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".β2,221Updated last year
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hβ¦β2,954Updated this week
- Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Trainingβ1,842Updated 2 weeks ago
- Implementation of RETRO, Deepmind's Retrieval based Attention net, in Pytorchβ876Updated 2 years ago
- A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.β1,237Updated last week
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Modelsβ1,562Updated last year
- Automatically split your PyTorch models on multiple GPUs for training & inferenceβ658Updated last year
- Ongoing research training transformer language models at scale, including: BERT & GPT-2β1,426Updated last year
- A pytorch quantization backend for optimumβ1,011Updated last week
- Parallelformers: An Efficient Model Parallelization Toolkit for Deploymentβ791Updated 2 years ago
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,307Updated 8 months ago
- PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.β829Updated 3 months ago
- A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.β897Updated last week
- The official implementation of βSophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-trainingββ979Updated last year
- Tutel MoE: Optimized Mixture-of-Experts Library, Support GptOss/DeepSeek/Kimi-K2/Qwen3 using FP8/NVFP4/MXFP4β943Updated 3 weeks ago
- PyTorch native quantization and sparsity for training and inferenceβ2,531Updated this week
- Transformer related optimization, including BERT, GPTβ6,355Updated last year
- Fast Inference Solutions for BLOOMβ564Updated last year