TLLM_QMM strips the implementation of quantized kernels of Nvidia's TensorRT-LLM, removing NVInfer dependency and exposes ease of use Pytorch module. We modified the dequantation and weight preprocessing to align with popular quantization alogirthms such as AWQ and GPTQ, and combine them with new FP8 quantization.
☆16Jul 5, 2024Updated last year
Alternatives and similar repositories for TLLM_QMM
Users that are interested in TLLM_QMM are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Awesome Quantization Paper lists with Codes☆10Feb 24, 2021Updated 5 years ago
- ☆26Feb 17, 2025Updated last year
- ☆27Jan 8, 2024Updated 2 years ago
- ☆10Jun 28, 2019Updated 6 years ago
- PyTorch code for our paper "Progressive Binarization with Semi-Structured Pruning for LLMs"☆13Mar 11, 2026Updated last month
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- Deep Variational Information Bottleneck (DVIB) in PyTorch.☆10Apr 25, 2020Updated 6 years ago
- PyTorch implementation of "Deep Transferring Quantization" (ECCV2020)☆18Jun 22, 2022Updated 3 years ago
- 一个非常高效的字符串匹配工具,支持正向/反向最大匹配分词和多模式字符串精确匹配☆16Jul 29, 2023Updated 2 years ago
- 这是一款刷单平台的后台管理,主要针对商家,员工,订单,任务等进行一系列的管理☆11May 8, 2019Updated 6 years ago
- 放一些论文,简历之类的latex模板☆12Apr 17, 2022Updated 4 years ago
- Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, all…☆35Aug 28, 2023Updated 2 years ago
- Quantize pytorch model, support post-training quantization and quantization aware training methods☆14Jun 15, 2023Updated 2 years ago
- Adaptive floating-point based numerical format for resilient deep learning☆14Apr 11, 2022Updated 4 years ago
- ☆13Oct 5, 2020Updated 5 years ago
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- implement bert in pure c++☆37Apr 29, 2020Updated 6 years ago
- ECE 5745 Tutorial 8: SRAM Generators☆15Mar 5, 2022Updated 4 years ago
- Generate a Verilog Source file and testbench file for a given Moore FSM☆17Nov 18, 2012Updated 13 years ago
- Routing framework based on Promise using CoffeeScript☆13Jun 28, 2015Updated 10 years ago
- ☆10Jun 6, 2023Updated 2 years ago
- High performance NCCL plugin for Bagua.☆15Sep 15, 2021Updated 4 years ago
- Pytorch implementation of RAPQ, IJCAI 2022☆23Jul 19, 2023Updated 2 years ago
- A highly optimized LLM inference acceleration engine for Llama and its variants.☆904Mar 18, 2026Updated last month
- A C++-based RPC framework☆12Oct 28, 2021Updated 4 years ago
- Serverless GPU API endpoints on Runpod - Get Bonus Credits • AdSkip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
- Converting Chinese sentences into pinyin sequences, implemented in C++, very fast and easy to deploy.☆21Jan 5, 2026Updated 4 months ago
- ROS TensorRT Inference Nodes for DIGITS on the Jetson☆15Apr 6, 2019Updated 7 years ago
- PyTorch implementation of "MLP-Mixer: An all-MLP Architecture for Vision" Tolstikhin et al. (2021)☆31May 13, 2021Updated 4 years ago
- ☆22Jul 11, 2023Updated 2 years ago
- 2020语言与智能技术竞赛:关系抽取任务(https://aistudio.baidu.com/aistudio/competition/detail/31?lang=zh_CN)☆24May 19, 2020Updated 5 years ago
- Optimized Parallel Tiled Approach to perform Matrix Multiplication by taking advantage of the lower latency, higher bandwidth shared memo…☆16Sep 24, 2017Updated 8 years ago
- OneFlow Serving☆20Apr 10, 2025Updated last year
- Multiple GEMM operators are constructed with cutlass to support LLM inference.☆20Aug 3, 2025Updated 9 months ago
- SRAM☆24Sep 6, 2020Updated 5 years ago
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- CentOS docker images, build weekly with latest security updates☆11Updated this week
- This project is for dealing with dynamic multiobjective optimization problems using a Multiobjective Evolutionary Algorithm.☆21May 6, 2018Updated 7 years ago
- ☆22Apr 27, 2026Updated last week
- LRU Cache for node.js/browser.☆12Jul 5, 2017Updated 8 years ago
- ☆16Feb 24, 2026Updated 2 months ago
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆83Sep 15, 2025Updated 7 months ago
- ArXiv daily dump and viewer using GitHub Actions - luvata.github.io/arxive☆14Updated this week