neuralmagic / deepsparse
Sparsity-aware deep learning inference runtime for CPUs
☆2,975Updated 2 months ago
Related projects: ⓘ
- Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models☆2,039Updated last month
- Neural network model repository for highly sparse and sparse-quantized models with matching sparsification recipes☆364Updated 2 months ago
- 🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools☆2,459Updated this week
- Supercharge Your Model Training☆5,116Updated this week
- SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX R…☆2,146Updated this week
- ML model optimization product to accelerate inference.☆318Updated 5 months ago
- Top-level directory for documentation and general content☆120Updated 2 months ago
- Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackab…☆1,519Updated 7 months ago
- Accessible large language models via k-bit quantization for PyTorch.☆6,029Updated this week
- AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (N…☆4,531Updated this week
- A machine learning compiler for GPUs, CPUs, and ML accelerators☆2,577Updated this week
- A Data Streaming Library for Efficient Neural Network Training☆1,076Updated this week
- Transformer related optimization, including BERT, GPT☆5,773Updated 5 months ago
- PyTorch extensions for high performance and large scale training.☆3,149Updated 2 weeks ago
- Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors a…☆1,131Updated this week
- MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.☆1,843Updated last week
- [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration☆2,333Updated 2 months ago
- ⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Pl…☆2,106Updated 3 weeks ago
- Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀☆1,643Updated 10 months ago
- A Python-level JIT compiler designed to make unmodified PyTorch programs faster.☆994Updated 5 months ago
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs…☆1,811Updated this week
- Simple, safe way to store and distribute tensors☆2,755Updated 2 weeks ago
- A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.☆2,206Updated this week
- A framework for few-shot evaluation of language models.☆6,426Updated this week
- Train to 94% on CIFAR-10 in <6.3 seconds on a single A100. Or ~95.79% in ~110 seconds (or less!)☆1,213Updated 10 months ago
- Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".☆1,875Updated 5 months ago
- Foundation Architecture for (M)LLMs☆3,003Updated 5 months ago
- Training and serving large-scale neural networks with auto parallelization.☆3,045Updated 9 months ago
- LLM training code for Databricks foundation models☆3,964Updated this week
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆1,645Updated this week