TLLM_QMM strips the implementation of quantized kernels of Nvidia's TensorRT-LLM, removing NVInfer dependency and exposes ease of use Pytorch module. We modified the dequantation and weight preprocessing to align with popular quantization alogirthms such as AWQ and GPTQ, and combine them with new FP8 quantization.
☆16Jul 5, 2024Updated last year
Alternatives and similar repositories for TLLM_QMM
Users that are interested in TLLM_QMM are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ☆11Jun 4, 2024Updated last year
- ☆26Feb 17, 2025Updated last year
- Möbius Transformation for Fast Inner Product Search on Graph☆23Jun 3, 2021Updated 4 years ago
- Bagua tutorials.☆13Sep 4, 2022Updated 3 years ago
- ☆27Jan 8, 2024Updated 2 years ago
- AI Agents on DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Deep Variational Information Bottleneck (DVIB) in PyTorch.☆10Apr 25, 2020Updated 6 years ago
- TensorRT-in-Action 是一个 GitHub 代码库,提供了使用 TensorRT 的代码示例,并有对应 Jupyter Notebook。☆15Jun 1, 2023Updated 2 years ago
- Low Precision Arithmetic Simulation in PyTorch - extension for posit and beyond☆16Dec 9, 2025Updated 5 months ago
- 一个非常高效的字符串匹配工具,支持正向/反向最大匹配分词和多模式字符串精确匹配☆16Jul 29, 2023Updated 2 years ago
- 16 bit serial multiplier in SystemVerilog☆13Oct 13, 2018Updated 7 years ago
- ☆14Mar 21, 2020Updated 6 years ago
- AFP is a hardware-friendly quantization framework for DNNs, which is contributed by Fangxin Liu and Wenbo Zhao.☆13Nov 8, 2021Updated 4 years ago
- 放一些论文,简历之类的latex模板☆12Apr 17, 2022Updated 4 years ago
- Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, all…☆35Aug 28, 2023Updated 2 years ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Quantize pytorch model, support post-training quantization and quantization aware training methods☆14Jun 15, 2023Updated 2 years ago
- Adaptive floating-point based numerical format for resilient deep learning☆14Apr 11, 2022Updated 4 years ago
- ECE 5745 Tutorial 8: SRAM Generators☆16Mar 5, 2022Updated 4 years ago
- Generate a Verilog Source file and testbench file for a given Moore FSM☆17Nov 18, 2012Updated 13 years ago
- ☆10Jun 6, 2023Updated 2 years ago
- High performance NCCL plugin for Bagua.☆15Sep 15, 2021Updated 4 years ago
- A C++ port of karpathy/micrograd, a tiny scalar-valued autograd engine and a neural net library☆13Nov 24, 2023Updated 2 years ago
- A c++ hash map/table which utilizes simd (specifically Intel x86 SSE/AVX)☆12Apr 30, 2019Updated 7 years ago
- A-SOUL鼓励师 for VS Code☆32Jan 27, 2023Updated 3 years ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- A highly optimized LLM inference acceleration engine for Llama and its variants.☆905Mar 18, 2026Updated 2 months ago
- the completion of CNNs by myself☆14Oct 8, 2015Updated 10 years ago
- A C++-based RPC framework☆12Oct 28, 2021Updated 4 years ago
- **curve_fit_utils** is a Python module containing useful tools for curve fitting☆18Dec 23, 2017Updated 8 years ago
- Converting Chinese sentences into pinyin sequences, implemented in C++, very fast and easy to deploy.☆21Jan 5, 2026Updated 4 months ago
- PyTorch implementation of "MLP-Mixer: An all-MLP Architecture for Vision" Tolstikhin et al. (2021)☆31May 13, 2021Updated 5 years ago
- ☆22Jul 11, 2023Updated 2 years ago
- 在苏剑林老师的代码上改了一下,改成了python3.6,基于膨胀卷积,字词混合向量,radam梯度优化算法,百度百科词向量的阅读理解模型☆24Aug 28, 2019Updated 6 years ago
- MXNet Model Serving☆25Oct 4, 2017Updated 8 years ago
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- 32 bit pipelined binary floating point adder using IEEE-754 Single Precision Format in Verilog☆18Aug 27, 2020Updated 5 years ago
- 各大GPU厂商以及平台商关于3D图形渲染的demo☆24May 15, 2026Updated last week
- Multiple GEMM operators are constructed with cutlass to support LLM inference.☆20Aug 3, 2025Updated 9 months ago
- SRAM☆24Sep 6, 2020Updated 5 years ago
- CentOS docker images, build weekly with latest security updates☆11May 18, 2026Updated last week
- Foolbox implementation for NeurIPS 2021 Paper: "Fast Minimum-norm Adversarial Attacks through Adaptive Norm Constraints".☆25Mar 16, 2022Updated 4 years ago
- ☆22May 13, 2026Updated last week