PraveenRaja42 / Tiny-Stories-GPT
A minimal PyTorch re-implementation of GPT (Generative Pretrained Transformer) language model training
☆15Updated last year
Alternatives and similar repositories for Tiny-Stories-GPT
Users that are interested in Tiny-Stories-GPT are comparing it to the libraries listed below
Sorting:
- ☆38Updated 9 months ago
- Jax like function transformation engine but micro, microjax☆32Updated 6 months ago
- ☆27Updated 10 months ago
- ☆9Updated 6 months ago
- Just large language models. Hackable, with as little abstraction as possible. Done for my own purposes, feel free to rip.☆44Updated last year
- A place to store reusable transformer components of my own creation or found on the interwebs☆55Updated this week
- Simple repository for training small reasoning models☆27Updated 3 months ago
- NanoGPT-speedrunning for the poor T4 enjoyers☆65Updated 3 weeks ago
- gzip Predicts Data-dependent Scaling Laws☆35Updated 11 months ago
- This repository contain the simple llama3 implementation in pure jax.☆63Updated 3 months ago
- LLM training in simple, raw C/CUDA☆14Updated 5 months ago
- ☆22Updated last year
- Collection of autoregressive model implementation☆85Updated 3 weeks ago
- ☆20Updated last year
- ☆61Updated last year
- DiCE: The Infinitely Differentiable Monte-Carlo Estimator☆31Updated last year
- Rust Implementation of micrograd☆51Updated 10 months ago
- NanoGPT (124M) quality in 2.67B tokens☆28Updated 2 weeks ago
- ☆40Updated last year
- ☆53Updated 5 months ago
- A repository for training nanogpt-based Chess playing language models.☆24Updated last year
- Various handy scripts to quickly setup new Linux and Windows sandboxes, containers and WSL.☆40Updated 3 weeks ago
- Full finetuning of large language models without large memory requirements☆94Updated last year
- Codes accompanying the paper "LaProp: a Better Way to Combine Momentum with Adaptive Gradient"☆28Updated 4 years ago
- Generative cellular automaton-like learning environments for RL.☆19Updated 3 months ago
- Andrej Kapathy's micrograd implemented in c☆28Updated 9 months ago
- ☆41Updated 4 months ago
- Optimizing bit-level Jaccard Index and Population Counts for large-scale quantized Vector Search via Harley-Seal CSA and Lookup Tables☆18Updated this week
- Large scale 4D parallelism pre-training for 🤗 transformers in Mixture of Experts *(still work in progress)*☆82Updated last year
- alternative way to calculating self attention☆18Updated 11 months ago