ELS-RD / kernlLinks
Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.
β1,585Updated last year
Alternatives and similar repositories for kernl
Users that are interested in kernl are comparing it to the libraries listed below
Sorting:
- A Python-level JIT compiler designed to make unmodified PyTorch programs faster.β1,068Updated last year
- Efficient, scalable and enterprise-grade CPU/GPU inference server for π€ Hugging Face transformer models πβ1,688Updated last year
- Pipeline Parallelism for PyTorchβ783Updated last year
- MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.β2,081Updated 5 months ago
- Cramming the training of a (BERT-type) language model into limited compute.β1,354Updated last year
- An open-source efficient deep learning framework/compiler, written in python.β737Updated 3 months ago
- AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (Nβ¦β4,697Updated last month
- Library for 8-bit optimizers and quantization routines.β779Updated 3 years ago
- Central place for the engineering/scaling WG: documentation, SLURM scripts and logs, compute environment and data.β1,006Updated last year
- PyTorch extensions for high performance and large scale training.β3,388Updated 7 months ago
- maximal update parametrization (Β΅P)β1,638Updated last year
- Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".β2,234Updated last year
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hβ¦β2,993Updated this week
- The official implementation of βSophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-trainingββ980Updated last year
- Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Trainingβ1,849Updated this week
- Parallelformers: An Efficient Model Parallelization Toolkit for Deploymentβ791Updated 2 years ago
- Training and serving large-scale neural networks with auto parallelization.β3,167Updated 2 years ago
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Modelsβ1,570Updated last year
- Automatically split your PyTorch models on multiple GPUs for training & inferenceβ657Updated last year
- SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, β¦β2,543Updated last week
- A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.β901Updated this week
- π Accelerate inference and training of π€ Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimizationβ¦β3,203Updated last week
- A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.β1,240Updated this week
- A pytorch quantization backend for optimumβ1,014Updated 3 weeks ago
- Fast Inference Solutions for BLOOMβ564Updated last year
- Ongoing research training transformer language models at scale, including: BERT & GPT-2β1,426Updated last year
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,307Updated 9 months ago
- PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.β834Updated 4 months ago
- Large-scale model inference.β628Updated 2 years ago
- Tutel MoE: Optimized Mixture-of-Experts Library, Support GptOss/DeepSeek/Kimi-K2/Qwen3 using FP8/NVFP4/MXFP4β948Updated last month