abetlen / ggml-python
Python bindings for ggml
☆140Updated 6 months ago
Alternatives and similar repositories for ggml-python:
Users that are interested in ggml-python are comparing it to the libraries listed below
- RWKV in nanoGPT style☆187Updated 9 months ago
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆271Updated last year
- LLM-based code completion engine☆181Updated 2 months ago
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated 5 months ago
- GPTQ inference Triton kernel☆298Updated last year
- Inference of Mamba models in pure C☆186Updated last year
- ☆528Updated 4 months ago
- Python bindings for llama.cpp☆200Updated last year
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"☆362Updated last year
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆196Updated 8 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆262Updated 5 months ago
- Official implementation of Half-Quadratic Quantization (HQQ)☆770Updated this week
- SoTA Transformers with C-backend for fast inference on your CPU.☆311Updated last year
- A torchless, c++ rwkv implementation using 8bit quantization, written in cuda/hip/vulkan for maximum compatibility and minimum dependenci…☆310Updated last year
- ☆203Updated 2 months ago
- This is our own implementation of 'Layer Selective Rank Reduction'☆233Updated 9 months ago
- Google TPU optimizations for transformers models☆103Updated 2 months ago
- Easy and Efficient Quantization for Transformers☆193Updated last month
- Low-Rank adapter extraction for fine-tuned transformers models☆171Updated 10 months ago
- Advanced Quantization Algorithm for LLMs/VLMs.☆394Updated this week
- Fast low-bit matmul kernels in Triton☆267Updated this week
- 1.58-bit LLaMa model☆82Updated 11 months ago
- Landmark Attention: Random-Access Infinite Context Length for Transformers QLoRA☆123Updated last year
- ☆49Updated last year
- FlashAttention (Metal Port)☆457Updated 6 months ago
- Full finetuning of large language models without large memory requirements☆93Updated last year
- llama.cpp fork with additional SOTA quants and improved performance☆217Updated this week
- Train your own small bitnet model☆65Updated 5 months ago
- Experiments on speculative sampling with Llama models☆125Updated last year
- Inference Vision Transformer (ViT) in plain C/C++ with ggml☆262Updated 11 months ago