Pipeline for pulling and processing online language model pretraining data from the web
β179Jul 31, 2023Updated 2 years ago
Alternatives and similar repositories for olm-datasets
Users that are interested in olm-datasets are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.β98Feb 9, 2023Updated 3 years ago
- A utility for storing and reading files for Korean LM training πΎβ35Oct 15, 2025Updated 8 months ago
- Korean Named Entity Corpusβ25May 12, 2023Updated 3 years ago
- β11Oct 3, 2021Updated 4 years ago
- All-in-one text de-duplicationβ764Mar 9, 2026Updated 3 months ago
- Deploy on Railway without the complexity - Free Credits Offer β’ AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Code used for sourcing and cleaning the BigScience ROOTS corpusβ318Mar 20, 2023Updated 3 years ago
- This is project for korean auto spacingβ12Aug 3, 2020Updated 5 years ago
- kogptλ₯Ό osloλ‘ νμΈνλνλ μμ .β23Aug 26, 2022Updated 3 years ago
- Long-context pretrained encoder-decoder modelsβ97Oct 28, 2022Updated 3 years ago
- β1,272Jul 30, 2024Updated last year
- Convert Numerical Representations to Korean Pronunciationβ14Apr 20, 2020Updated 6 years ago
- Train π€transformers with DeepSpeed: ZeRO-2, ZeRO-3