ssiu / flash-attention-turing
☆22Updated this week
Alternatives and similar repositories for flash-attention-turing:
Users that are interested in flash-attention-turing are comparing it to the libraries listed below
- LM inference server implementation based on *.cpp.☆173Updated this week
- Efficient, Flexible, and Highly Fault-Tolerant Model Service Management Based on SGLang☆49Updated 5 months ago
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆263Updated 6 months ago
- ☆185Updated 6 months ago
- Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLa…☆494Updated this week
- DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including …☆243Updated last week
- ☆130Updated last month
- The homepage of OneBit model quantization framework.☆175Updated 2 months ago
- Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).☆242Updated last year
- ☆83Updated last month
- Fast and memory-efficient exact attention☆171Updated this week
- Model Compression Toolbox for Large Language Models and Diffusion Models☆446Updated last month
- vLLM Router☆26Updated last year
- ☆43Updated last week
- 📖A curated list of Awesome Diffusion Inference Papers with codes: Sampling, Caching, Multi-GPUs, etc. 🎉🎉☆212Updated last month
- Advanced Quantization Algorithm for LLMs/VLMs.☆438Updated this week
- ☆119Updated last week
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.☆168Updated 3 weeks ago
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆630Updated this week
- PyTorch library for cost-effective, fast and easy serving of MoE models.☆171Updated last week
- xllamacpp - a Python wrapper of llama.cpp☆35Updated 2 weeks ago
- Pure C++ implementation of several models for real-time chatting on your computer (CPU & GPU)☆578Updated this week
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆45Updated 9 months ago
- run DeepSeek-R1 GGUFs on KTransformers☆224Updated last month
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆809Updated 7 months ago
- LLM inference in C/C++☆71Updated this week
- A safetensors extension to efficiently store sparse quantized tensors on disk☆102Updated this week
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆98Updated last year
- A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …☆163Updated 9 months ago