mit-han-lab / tinychat-tutorialLinks

☆73

Alternatives and similar repositories for tinychat-tutorial

Users that are interested in tinychat-tutorial are comparing it to the libraries listed below

Sorting:

ByteDance-Seed / cudaLLM
☆115Updated last month
Bruce-Lee-LY / decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆44Updated 4 months ago
INT-FlashAttention2024 / INT-FlashAttention
☆82Updated 8 months ago
flashinfer-ai / cutlass-viz
☆64Updated 5 months ago
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆96Updated 9 months ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆119Updated 5 months ago
mit-han-lab / parallel-computing-tutorial
☆174Updated 2 years ago
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆84Updated 3 weeks ago
InternLM / turbomind
☆95Updated 6 months ago
ankan-ban / llama_cu_awq
llama INT4 cuda inference with AWQ
☆55Updated 8 months ago
feifeibear / ChituAttention
Quantized Attention on GPU
☆44Updated 10 months ago
microsoft / AttentionEngine
☆99Updated 4 months ago
IST-DASLab / qutlass
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆111Updated last week
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆116Updated last year
xxyux / SpInfer
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆59Updated 6 months ago
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆97Updated 3 months ago
tile-ai / AttentionEngine
☆50Updated 4 months ago
xlite-dev / ffpa-attn
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆220Updated 2 months ago
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆111Updated last year
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆39Updated last year
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆71Updated 5 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆124Updated 4 months ago
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆94Updated 3 weeks ago
DD-DuDa / BitDecoding
A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆60Updated this week
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆265Updated 2 months ago
thunlp / TritonBench
TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators
☆83Updated 3 months ago
HanGuo97 / hilt
☆33Updated this week
DD-DuDa / BitDistiller
[ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.
☆123Updated last year
triton-lang / kernels
☆90Updated 11 months ago
thu-ml / Jetfire-INT8Training
☆55Updated last year