mikex86 / LibreCuda
☆1,000Updated 3 weeks ago
Related projects ⓘ
Alternatives and complementary repositories for LibreCuda
- NVIDIA Linux open GPU with P2P support☆903Updated 5 months ago
- Tile primitives for speedy kernels☆1,645Updated this week
- SCUDA is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.☆555Updated this week
- Stateful load balancer custom-tailored for llama.cpp☆557Updated last week
- llama3.np is a pure NumPy implementation for Llama 3 model.☆973Updated 5 months ago
- ☆383Updated this week
- ☆179Updated 2 months ago
- Felafax is building AI infra for non-NVIDIA GPUs☆503Updated last week
- Nvidia Instruction Set Specification Generator☆215Updated 4 months ago
- ☆235Updated 7 months ago
- Up to 200x Faster Dot Products & Similarity Metrics — for Python, Rust, C, JS, and Swift, supporting f64, f32, f16 real & complex, i8, an…☆975Updated this week
- Richard is gaining power☆174Updated 2 months ago
- nanoGPT style version of Llama 3.1☆1,236Updated 3 months ago
- Apple AMX Instruction Set☆992Updated 5 months ago
- Llama 2 Everywhere (L2E)☆1,511Updated 2 weeks ago
- Flash Attention in ~100 lines of CUDA (forward pass only)☆615Updated 7 months ago
- Reverse engineered Linux driver for the Apple Neural Engine (ANE).☆366Updated 8 months ago
- A modern model graph visualizer and debugger☆1,046Updated this week
- Minimal LLM inference in Rust☆917Updated 2 weeks ago
- GGUF implementation in C as a library and a tools CLI program☆242Updated 4 months ago
- Deep learning accelerator architectures requiring half the multipliers☆262Updated 7 months ago
- Fast, Multi-threaded Matrix Multiplication in C☆181Updated 3 weeks ago
- High performance AI inference stack. Built for production. @ziglang / @openxla / MLIR / @bazelbuild☆1,639Updated this week
- LLM-powered lossless compression tool☆252Updated 2 months ago
- Because tinygrad got out of hand with line count☆143Updated 3 weeks ago
- throwaway GPT inference☆139Updated 5 months ago
- ☆223Updated last month
- NanoGPT (124M) quality in 7.8 8xH100-minutes☆965Updated this week
- UNet diffusion model in pure CUDA☆573Updated 4 months ago
- Tensor parallelism is all you need. Run LLMs on an AI cluster at home using any device. Distribute the workload, divide RAM usage, and in…☆1,486Updated this week