gpustack / gguf-parser-go
Review/Check GGUF files and estimate the memory usage and maximum tokens per second.
☆135Updated 2 weeks ago
Alternatives and similar repositories for gguf-parser-go:
Users that are interested in gguf-parser-go are comparing it to the libraries listed below
- LM inference server implementation based on *.cpp.☆154Updated this week
- A text-to-speech and speech-to-text server compatible with the OpenAI API, supporting Whisper, FunASR, Bark, and CosyVoice backends.☆87Updated 2 months ago
- Comparison of Language Model Inference Engines☆208Updated 3 months ago
- Unsloth Fine-tuning Notebooks for Google Colab, Kaggle, Hugging Face and more.☆105Updated this week
- automatically quant GGUF models☆164Updated last week
- OpenAI compatible API for TensorRT LLM triton backend☆201Updated 8 months ago
- llama.cpp fork with additional SOTA quants and improved performance☆231Updated this week
- Lightweight Inference server for OpenVINO☆143Updated this week
- LLM inference in C/C++☆67Updated this week
- ggml implementation of embedding models including SentenceTransformer and BGE☆56Updated last year
- ☆81Updated 2 weeks ago
- CPU inference for the DeepSeek family of large language models in pure C++☆282Updated last month
- Self-hosted huggingface mirror service. 自建huggingface镜像服务。☆146Updated 2 weeks ago
- Download models from the Ollama library, without Ollama☆66Updated 4 months ago
- transparent proxy server on demand model swapping for llama.cpp (or any local OpenAPI compatible server)☆482Updated this week
- VSCode AI coding assistant powered by self-hosted llama.cpp endpoint.☆181Updated 2 months ago
- Open Source Text Embedding Models with OpenAI Compatible API☆150Updated 8 months ago
- An endpoint server for efficiently serving quantized open-source LLMs for code.☆54Updated last year
- Pure C++ implementation of several models for real-time chatting on your computer (CPU & GPU)☆558Updated this week
- Something similar to Apple Intelligence?☆59Updated 8 months ago
- Unsloth Studio☆73Updated 2 weeks ago
- Building open version of OpenAI o1 via reasoning traces (Groq, ollama, Anthropic, Gemini, OpenAI, Azure supported) Demo: https://hugging…☆175Updated 5 months ago
- Run multiple resource-heavy Large Models (LM) on the same machine with limited amount of VRAM/other resources by exposing them on differe…☆55Updated last month
- A memory framework for Large Language Models and Agents.☆177Updated 3 months ago
- Comparison of the output quality of quantization methods, using Llama 3, transformers, GGUF, EXL2.☆148Updated 10 months ago
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆87Updated this week
- The main repository for building Pascal-compatible versions of ML applications and libraries.☆63Updated last week
- Cortex.Tensorrt-LLM is a C++ inference library that can be loaded by any server at runtime. It submodules NVIDIA’s TensorRT-LLM for GPU a…☆43Updated 6 months ago
- Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLa…☆406Updated this week
- ☆83Updated 3 months ago