SumanthRH / tokenization
A comprehensive deep dive into the world of tokens
☆218Updated 7 months ago
Alternatives and similar repositories for tokenization:
Users that are interested in tokenization are comparing it to the libraries listed below
- Fully fine-tune large models like Mistral, Llama-2-13B, or Qwen-14B completely for free☆230Updated 3 months ago
- ☆92Updated last year
- ☆496Updated 2 months ago
- A bagel, with everything.☆316Updated 10 months ago
- awesome synthetic (text) datasets☆259Updated 3 months ago
- Generate textbook-quality synthetic LLM pretraining data☆494Updated last year
- Automatic Evals for LLMs☆201Updated this week
- Convert all of libgen to high quality markdown☆245Updated last year
- Manage scalable open LLM inference endpoints in Slurm clusters☆252Updated 7 months ago
- A set of scripts and notebooks on LLM finetunning and dataset creation☆102Updated 4 months ago
- Comprehensive analysis of difference in performance of QLora, Lora, and Full Finetunes.☆82Updated last year
- ☆501Updated 5 months ago
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆100Updated 10 months ago
- batched loras☆338Updated last year
- experiments with inference on llama☆104Updated 8 months ago
- RuLES: a benchmark for evaluating rule-following in language models☆217Updated this week
- an implementation of Self-Extend, to expand the context window via grouped attention☆118Updated last year
- ☆207Updated 7 months ago
- Website for hosting the Open Foundation Models Cheat Sheet.☆262Updated 7 months ago
- Just a bunch of benchmark logs for different LLMs☆119Updated 6 months ago
- An easy-to-understand framework for LLM samplers that rewind and revise generated tokens☆129Updated last week
- An introduction to LLM Sampling☆75Updated 2 months ago
- The official evaluation suite and dynamic data release for MixEval.☆231Updated 3 months ago
- A puzzle to learn about prompting☆124Updated last year
- Full finetuning of large language models without large memory requirements☆93Updated last year
- Fast & more realistic evaluation of chat language models. Includes leaderboard.☆183Updated last year
- Fast bare-bones BPE for modern tokenizer training☆145Updated 3 months ago
- Generate Synthetic Data Using OpenAI, MistralAI or AnthropicAI☆222Updated 9 months ago
- Solving data for LLMs - Create quality synthetic datasets!☆145Updated 3 weeks ago
- Implementation of paper Data Engineering for Scaling Language Models to 128K Context☆451Updated 10 months ago