tpoisonooo / llama.onnx
LLaMa/RWKV onnx models, quantization and testcase
☆345Updated last year
Related projects: ⓘ
- export llama to onnx☆91Updated 3 months ago
- llm-export can export llm model to onnx.☆193Updated this week
- GPTQ inference Triton kernel☆273Updated last year
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving☆399Updated 2 weeks ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆562Updated 2 weeks ago
- ☆123Updated 3 months ago
- ☆140Updated 4 months ago
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.☆141Updated 3 weeks ago
- Ascend PyTorch adapter (torch_npu). Mirror of https://gitee.com/ascend/pytorch☆226Updated this week
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"☆240Updated 2 weeks ago
- The official implementation of the EMNLP 2023 paper LLM-FP4☆156Updated 9 months ago
- A quantization algorithm for LLM☆98Updated 2 months ago
- FlashInfer: Kernel Library for LLM Serving☆1,143Updated this week
- optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052☆451Updated 6 months ago
- This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit…☆227Updated this week
- Universal cross-platform tokenizers binding to HF and sentencepiece☆246Updated last month
- ☆148Updated this week
- The CUDA version of the RWKV language model ( https://github.com/BlinkDL/RWKV-LM )☆208Updated 4 months ago
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆342Updated this week
- TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, sparsity, distillat…☆439Updated this week
- A model compression and acceleration toolbox based on pytorch.☆325Updated 8 months ago
- Reorder-based post-training quantization for large language model☆178Updated last year
- A throughput-oriented high-performance serving framework for LLMs☆470Updated this week
- [ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.☆672Updated last month
- FlagGems is an operator library for large language models implemented in Triton Language.☆246Updated this week
- simplify >2GB large onnx model☆41Updated 6 months ago
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline mod…☆276Updated last week
- The Triton TensorRT-LLM Backend☆654Updated this week
- Transformer related optimization, including BERT, GPT☆58Updated 11 months ago
- Official Implementation of EAGLE-1 and EAGLE-2☆749Updated 3 weeks ago