sayakpaul / count-tokens-hf-datasetsLinks

This project shows how to derive the total number of training tokens from a large text dataset from 🤗 datasets with Apache Beam and Dataflow.

☆27

Alternatives and similar repositories for count-tokens-hf-datasets

Users that are interested in count-tokens-hf-datasets are comparing it to the libraries listed below

Sorting:

huggingface / olm-training
Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.
☆96Updated 2 years ago
facebookresearch / coder_reviewer_reranking
Official code release for the paper Coder Reviewer Reranking for Code Generation.
☆45Updated 2 years ago
allenai / EmbeddingRecycling
Embedding Recycling for Language models
☆38Updated 2 years ago
lucidrains / memory-editable-transformer
My explorations into editing the knowledge and memories of an attention network
☆34Updated 2 years ago
cimeister / typical-sampling
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
☆81Updated 3 years ago
stas00 / porting
Helper scripts and notes that were used while porting various nlp models
☆48Updated 3 years ago
gsarti / t5-flax-gcp
Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP
☆58Updated 3 years ago
HomebrewML / HomebrewNLP-torch
A case study of efficient training of large language models using commodity hardware.
☆68Updated 3 years ago
lucidrains / tableformer-pytorch
Implementation of TableFormer, Robust Transformer Modeling for Table-Text Encoding, in Pytorch
☆39Updated 3 years ago
SeanNaren / minGPT
A minimal PyTorch Lightning OpenAI GPT w DeepSpeed Training!
☆113Updated 2 years ago
google-research-datasets / QAmeleon
QAmeleon introduces synthetic multilingual QA data using PaLM, a 540B large language model. This dataset was generated by prompt tuning P…
☆35Updated 2 years ago
lucidrains / token-shift-gpt
Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing
☆50Updated 3 years ago
HendrikStrobelt / LMdiff
A diff tool for language models
☆44Updated last year
leogao2 / lm_dataformat
☆78Updated last year
martiansideofthemoon / hurdles-longform-qa
Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://a…
☆46Updated 3 years ago
facebookresearch / EditEval
An instruction-based benchmark for text improvements.
☆143Updated 2 years ago
lucidrains / n-grammer-pytorch
Implementation of N-Grammer, augmenting Transformers with latent n-grams, in Pytorch
☆76Updated 2 years ago
krandiash / quinine
A library to create and manage configuration files, especially for machine learning projects.
☆80Updated 3 years ago
lucidrains / g-mlp-gpt
GPT, but made only out of MLPs
☆89Updated 4 years ago
CarperAI / squeakily
A library for squeakily cleaning and filtering language datasets.
☆47Updated 2 years ago
hadasah / btm
☆76Updated last year
thevasudevgupta / bigbird
Google's BigBird (Jax/Flax & PyTorch) @ 🤗Transformers
☆49Updated 2 years ago
martiansideofthemoon / rankgen
Official code and model checkpoints for our EMNLP 2022 paper "RankGen - Improving Text Generation with Large Ranking Models" (https://arx…
☆138Updated 2 years ago
sayakpaul / BiT-jax2tf
This repository hosts the code to port NumPy model weights of BiT-ResNets to TensorFlow SavedModel format.
☆14Updated 3 years ago
joeljang / ELM
[ICML 2023] Exploring the Benefits of Training Expert Language Models over Instruction Tuning
☆98Updated 2 years ago
huggingface / data-measurements-tool
Developing tools to automatically analyze datasets
☆75Updated last year
Rallio67 / language-model-agents
Experiments with generating opensource language model assistants
☆97Updated 2 years ago
zphang / minimal-opt
☆67Updated 3 years ago
NathanGodey / headless-lm
Training and evaluation code for the paper "Headless Language Models: Learning without Predicting with Contrastive Weight Tying" (https:/…
☆27Updated last year
EleutherAI / semantic-memorization
☆44Updated 11 months ago