kuterd / nv_isa_solver
Nvidia Instruction Set Specification Generator
☆243Updated 7 months ago
Alternatives and similar repositories for nv_isa_solver:
Users that are interested in nv_isa_solver are comparing it to the libraries listed below
- High-Performance SGEMM on CUDA devices☆76Updated last month
- GPUOcelot: A dynamic compilation framework for PTX☆169Updated last week
- Apple GPU microarchitecture☆498Updated 4 months ago
- Tutorials on tinygrad☆342Updated this week
- ctypes wrappers for HIP, CUDA, and OpenCL☆128Updated 7 months ago
- Fastest kernels written from scratch☆173Updated this week
- Visualization of cache-optimized matrix multiplication☆104Updated 5 years ago
- ☆428Updated 2 months ago
- pytorch from scratch in pure C/CUDA and python☆40Updated 4 months ago
- Unofficial description of the CUDA assembly (SASS) instruction sets.☆59Updated 3 months ago
- Tenstorrent MLIR compiler☆91Updated this week
- Fast CUDA matrix multiplication from scratch☆634Updated last year
- parallelized hyperdimensional tictactoe☆112Updated 5 months ago
- ☆72Updated this week
- Solve puzzles to improve your tinygrad skills!☆111Updated 5 months ago
- An experimental CPU backend for Triton☆90Updated this week
- LLM training in simple, raw C/CUDA☆91Updated 9 months ago
- Sniff CUDA ioctls☆189Updated last year
- Intel® Extension for MLIR. A staging ground for MLIR dialects and tools for Intel devices using the MLIR toolchain.☆130Updated this week
- An unofficial cuda assembler, for all generations of SASS, hopefully :)☆426Updated last year
- Ultra low overhead NVIDIA GPU telemetry plugin for telegraf with memory temperature readings.☆62Updated 7 months ago
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems☆183Updated this week
- Unified compiler/runtime for interfacing with PyTorch Dynamo.☆100Updated this week
- A minimal Tensor Processing Unit (TPU) inspired by Google's TPUv1.☆129Updated 6 months ago
- MLIR-based partitioning system☆62Updated this week
- extensible collectives library in triton☆83Updated 4 months ago
- High-Performance FP32 Matrix Multiplication on CPU☆333Updated this week
- Awesome resources for GPUs☆546Updated last year
- Exploring the scalable matrix extension of the Apple M4 processor☆164Updated 3 months ago