wangkuiyi / huggingface-tokenizer-in-cxx
☆49Updated last year
Related projects: ⓘ
- Universal cross-platform tokenizers binding to HF and sentencepiece☆246Updated last month
- ☆123Updated 9 months ago
- Minimal example of using a traced huggingface transformers model with libtorch☆35Updated 4 years ago
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.☆141Updated 3 weeks ago
- qwen2 and llama3 cpp implementation☆34Updated 3 months ago
- The Triton backend for the ONNX Runtime.☆122Updated this week
- A Toolkit to Help Optimize Large Onnx Model☆53Updated this week
- ☆18Updated 4 months ago
- implement bert in pure c++☆30Updated 4 years ago
- OpenAI compatible API for TensorRT LLM triton backend☆148Updated last month
- ☆145Updated last month
- Dynamic batching library for Deep Learning inference. Tutorials for LLM, GPT scenarios.☆81Updated last month
- Common source, scripts and utilities shared across all Triton repositories.☆62Updated last week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆15Updated 3 months ago
- The Triton backend for TensorRT.☆59Updated last week
- Transformer related optimization, including BERT, GPT☆17Updated last year
- simplify >2GB large onnx model☆41Updated 6 months ago
- ☆110Updated 4 months ago
- Transformer related optimization, including BERT, GPT☆58Updated 11 months ago
- ggml implementation of embedding models including SentenceTransformer and BGE☆50Updated 8 months ago
- A quantization algorithm for LLM☆98Updated 2 months ago
- GGML implementation of BERT model with Python bindings and quantization.☆51Updated 7 months ago
- LLaMa/RWKV onnx models, quantization and testcase☆345Updated last year
- An easy-to-use package for implementing SmoothQuant for LLMs☆78Updated 4 months ago
- A project that optimizes Whisper for low latency inference using NVIDIA TensorRT☆47Updated 2 months ago
- Whisper in TensorRT-LLM☆14Updated 11 months ago
- ☆23Updated 11 months ago
- Easy and Efficient Quantization for Transformers☆172Updated 2 months ago
- DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including …☆130Updated 3 weeks ago
- ☆170Updated this week