michaelfeil / candle-flash-attn-v3Links
β12Updated 6 months ago
Alternatives and similar repositories for candle-flash-attn-v3
Users that are interested in candle-flash-attn-v3 are comparing it to the libraries listed below
Sorting:
- π· Build compute kernelsβ106Updated last week
- implement llava using candleβ15Updated last year
- Simple high-throughput inference libraryβ127Updated 3 months ago
- This repository has code for fine-tuning LLMs with GRPO specifically for Rust Programming using cargo as feedbackβ101Updated 5 months ago
- Rust crate for some audio utilitiesβ26Updated 5 months ago
- A high-performance constrained decoding engine based on context free grammar in Rustβ55Updated 3 months ago
- Proof of concept for running moshi/hibiki using webrtcβ20Updated 5 months ago
- Inference engine for GLiNER models, in Rustβ64Updated last month
- GPU based FFT written in Rust and CubeCLβ23Updated 2 months ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)β66Updated 5 months ago
- β21Updated 5 months ago
- vLLM adapter for a TGIS-compatible gRPC server.β37Updated this week
- Fast serverless LLM inference, in Rust.β88Updated 5 months ago
- High-performance MinHash implementation in Rust with Python bindings for efficient similarity estimation and deduplication of large datasβ¦β199Updated last month
- Load compute kernels from the Hubβ244Updated this week
- Inference Llama 2 in one file of zero-dependency, zero-unsafe Rustβ38Updated 2 years ago
- Repository containing the SPIN experiments on the DIBT 10k ranked promptsβ24Updated last year
- Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers, Tiktoken and more. Supports BPE, Unigram andβ¦β29Updated 5 months ago
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top ofβ¦β141Updated last year
- β12Updated last year
- Inference Llama 2 with a model compiled to native code by TorchInductorβ14Updated last year
- CUDA and Triton implementations of Flash Attention with SoftmaxN.β73Updated last year
- Cray-LM unified training and inference stack.β22Updated 6 months ago
- High-performance safetensors model loaderβ53Updated last month
- β24Updated 4 months ago
- PCCL (Prime Collective Communications Library) implements fault tolerant collective communications over IPβ106Updated this week
- IBM development fork of https://github.com/huggingface/text-generation-inferenceβ61Updated 3 months ago
- Make triton easierβ47Updated last year
- β39Updated 2 years ago
- TensorRT-LLM server with Structured Outputs (JSON) built with Rustβ58Updated 4 months ago