mlc-ai / tokenizers-cpp
Universal cross-platform tokenizers binding to HF and sentencepiece
☆307Updated last week
Alternatives and similar repositories for tokenizers-cpp:
Users that are interested in tokenizers-cpp are comparing it to the libraries listed below
- LLaMa/RWKV onnx models, quantization and testcase☆356Updated last year
- GPTQ inference Triton kernel☆297Updated last year
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆740Updated 5 months ago
- ☆410Updated last year
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.☆160Updated 2 weeks ago
- ☆176Updated 5 months ago
- Running BERT without Padding☆470Updated 2 years ago
- Common source, scripts and utilities for creating Triton backends.☆310Updated 2 weeks ago
- optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052☆469Updated 11 months ago
- ☆127Updated 2 months ago
- Common utilities for ONNX converters☆259Updated 3 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆237Updated 4 months ago
- The Triton backend for the ONNX Runtime.☆139Updated this week
- Transformer related optimization, including BERT, GPT☆59Updated last year
- onnxruntime-extensions: A specialized pre- and post- processing library for ONNX Runtime☆361Updated this week
- ☆157Updated this week
- ☆139Updated 10 months ago
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆530Updated 2 weeks ago
- ☆57Updated 2 years ago
- ☆124Updated last year
- Dynamic Memory Management for Serving LLMs without PagedAttention☆296Updated last week
- A throughput-oriented high-performance serving framework for LLMs☆745Updated 5 months ago
- The Triton TensorRT-LLM Backend☆790Updated this week
- OpenAI compatible API for TensorRT LLM triton backend☆198Updated 7 months ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…☆559Updated last week
- Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).☆237Updated 11 months ago
- ☆231Updated last week
- Easy and Efficient Quantization for Transformers☆192Updated 3 weeks ago
- ☆157Updated last year
- Actively maintained ONNX Optimizer☆673Updated last month