mlc-ai / tokenizers-cpp
Universal cross-platform tokenizers binding to HF and sentencepiece
☆274Updated this week
Related projects ⓘ
Alternatives and complementary repositories for tokenizers-cpp
- GPTQ inference Triton kernel☆284Updated last year
- optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052☆457Updated 8 months ago
- ☆140Updated 6 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆624Updated 2 months ago
- ☆157Updated last month
- ☆123Updated 2 weeks ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆208Updated 3 weeks ago
- LLaMa/RWKV onnx models, quantization and testcase☆352Updated last year
- Transformer related optimization, including BERT, GPT☆60Updated last year
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving☆443Updated last week
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.☆149Updated last month
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆420Updated this week
- ☆52Updated last year
- A collection of memory efficient attention operators implemented in the Triton language.☆219Updated 5 months ago
- Running BERT without Padding☆460Updated 2 years ago
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆29Updated 2 months ago
- Easy and Efficient Quantization for Transformers☆180Updated 4 months ago
- ☆412Updated last year
- Standalone Flash Attention v2 kernel without libtorch dependency☆98Updated 2 months ago
- A throughput-oriented high-performance serving framework for LLMs☆636Updated 2 months ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆90Updated 4 months ago
- Dynamic Memory Management for Serving LLMs without PagedAttention☆238Updated last week
- Common utilities for ONNX converters☆251Updated 5 months ago
- Applied AI experiments and examples for PyTorch☆166Updated 3 weeks ago
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆305Updated 3 months ago
- An easy-to-use package for implementing SmoothQuant for LLMs☆83Updated 6 months ago
- FlashInfer: Kernel Library for LLM Serving☆1,452Updated this week
- A fast communication-overlapping library for tensor parallelism on GPUs.☆224Updated 3 weeks ago
- ☆123Updated 11 months ago
- ☆167Updated 4 months ago