EleutherAI / tokengramsLinks
Efficiently computing & storing token n-grams from large corpora
β26Updated 9 months ago
Alternatives and similar repositories for tokengrams
Users that are interested in tokengrams are comparing it to the libraries listed below
Sorting:
- Plug-and-play Search Interfaces with Pyserini and Hugging Faceβ32Updated last year
- NLP with Rust for Python π¦πβ64Updated 2 months ago
- minimal pytorch implementation of bm25 (with sparse tensors)β104Updated last year
- Understanding how features learned by neural networks evolve throughout trainingβ36Updated 9 months ago
- β41Updated 3 months ago
- See https://github.com/cuda-mode/triton-index/ instead!β11Updated last year
- URL downloader supporting checkpointing and continuous checksumming.β19Updated last year
- Supercharge huggingface transformers with model parallelism.β77Updated last week
- β49Updated 5 months ago
- Training and evaluation code for the paper "Headless Language Models: Learning without Predicting with Contrastive Weight Tying" (https:/β¦β27Updated last year
- β39Updated last year
- This repository contains code for cleaning your training data of benchmark data to help combat data snooping.β25Updated 2 years ago
- A toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.β40Updated 3 weeks ago
- PyTorch implementation for MRLβ19Updated last year
- Trully flash implementation of DeBERTa disentangled attention mechanism.β62Updated 2 months ago
- Training code for Sparse Autoencoders on Embedding modelsβ38Updated 5 months ago
- Repository containing the SPIN experiments on the DIBT 10k ranked promptsβ24Updated last year
- A library for squeakily cleaning and filtering language datasets.β47Updated 2 years ago
- Code for SaGe subword tokenizer (EACL 2023)β25Updated 8 months ago
- Pre-train Static Word Embeddingsβ85Updated 2 months ago
- An unofficial implementation of the Infini-gram model proposed by Liu et al. (2024)β33Updated last year
- β57Updated 3 weeks ago
- β29Updated last year
- Experiments for efforts to train a new and improved t5β76Updated last year
- Chat Markup Language conversation libraryβ55Updated last year
- Small python package to measure OCR quality and other related metrics.β25Updated last year
- β37Updated last year
- new optimizerβ20Updated last year
- This is a new metric that can be used to evaluate faithfulness of text generated by LLMs. The work behind this repository can be found heβ¦β31Updated last year
- SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 laβ¦β49Updated last year