EleutherAI / tokengramsLinks
Efficiently computing & storing token n-grams from large corpora
β26Updated last year
Alternatives and similar repositories for tokengrams
Users that are interested in tokengrams are comparing it to the libraries listed below
Sorting:
- NLP with Rust for Python π¦πβ71Updated 8 months ago
- decontaminationβ24Updated 2 months ago
- Experiments for efforts to train a new and improved t5β76Updated last year
- Training code for Sparse Autoencoders on Embedding modelsβ39Updated 11 months ago
- Understanding how features learned by neural networks evolve throughout trainingβ41Updated last year
- minimal pytorch implementation of bm25 (with sparse tensors)β104Updated 3 months ago
- An unofficial implementation of the Infini-gram model proposed by Liu et al. (2024)β33Updated last year
- Code for SaGe subword tokenizer (EACL 2023)β27Updated last year
- Plug-and-play Search Interfaces with Pyserini and Hugging Faceβ32Updated 2 years ago
- β41Updated last year
- β53Updated 11 months ago
- β29Updated 2 years ago
- β94Updated last week
- Supercharge huggingface transformers with model parallelism.β77Updated 6 months ago
- This repository contains code for cleaning your training data of benchmark data to help combat data snooping.β27Updated 2 years ago
- PyTorch implementation for MRLβ20Updated last year
- β59Updated 2 months ago
- Training and evaluation code for the paper "Headless Language Models: Learning without Predicting with Contrastive Weight Tying" (https:/β¦β28Updated last year
- β90Updated 7 months ago
- Trully flash implementation of DeBERTa disentangled attention mechanism.β74Updated last week
- URL downloader supporting checkpointing and continuous checksumming.β19Updated 2 years ago
- [COLM '24] Source-Aware Training Enables Knowledge Attribution in Language Modelsβ19Updated 10 months ago
- A library for squeakily cleaning and filtering language datasets.β49Updated 2 years ago
- β44Updated last year
- Repository containing the SPIN experiments on the DIBT 10k ranked promptsβ23Updated last year
- This is a new metric that can be used to evaluate faithfulness of text generated by LLMs. The work behind this repository can be found heβ¦β31Updated 2 years ago
- Aioli: A unified optimization framework for language model data mixingβ32Updated last year
- β38Updated last year
- Alice in Wonderland code base for experiments and raw experiments dataβ131Updated 4 months ago
- A framework for pitting LLMs against each other in an evolving library of games ββ35Updated 9 months ago