ELS-RD / kernlLinks
Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.
β1,577Updated last year
Alternatives and similar repositories for kernl
Users that are interested in kernl are comparing it to the libraries listed below
Sorting:
- A Python-level JIT compiler designed to make unmodified PyTorch programs faster.β1,056Updated last year
- Efficient, scalable and enterprise-grade CPU/GPU inference server for π€ Hugging Face transformer models πβ1,687Updated 9 months ago
- Pipeline Parallelism for PyTorchβ775Updated 11 months ago
- Library for 8-bit optimizers and quantization routines.β773Updated 2 years ago
- Cramming the training of a (BERT-type) language model into limited compute.β1,344Updated last year
- MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.β2,045Updated last month
- maximal update parametrization (Β΅P)β1,579Updated last year
- An open-source efficient deep learning framework/compiler, written in python.β715Updated 2 weeks ago
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blaβ¦β2,627Updated this week
- Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".β2,161Updated last year
- PyTorch extensions for high performance and large scale training.β3,355Updated 3 months ago
- Central place for the engineering/scaling WG: documentation, SLURM scripts and logs, compute environment and data.β1,007Updated last year
- Parallelformers: An Efficient Model Parallelization Toolkit for Deploymentβ791Updated 2 years ago
- A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.β1,217Updated this week
- The official implementation of βSophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-trainingββ965Updated last year
- AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (Nβ¦β4,670Updated 2 weeks ago
- Tutel MoE: Optimized Mixture-of-Experts Library, Support GptOss/DeepSeek/Kimi-K2/Qwen3 FP8/NVFP4/MXFP4β885Updated this week
- Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Trainingβ1,815Updated this week
- Training and serving large-scale neural networks with auto parallelization.β3,147Updated last year
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Modelsβ1,471Updated last year
- PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.β814Updated last week
- A pytorch quantization backend for optimumβ979Updated last month
- Ongoing research training transformer language models at scale, including: BERT & GPT-2β1,406Updated last year
- β412Updated last year
- A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.β845Updated last week
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,267Updated 5 months ago
- Automatically split your PyTorch models on multiple GPUs for training & inferenceβ656Updated last year
- Implementation of RETRO, Deepmind's Retrieval based Attention net, in Pytorchβ870Updated last year
- Serving multiple LoRA finetuned LLM as oneβ1,082Updated last year
- Microsoft Automatic Mixed Precision Libraryβ616Updated 10 months ago