Pipeline for pulling and processing online language model pretraining data from the web
β179Jul 31, 2023Updated 2 years ago
Alternatives and similar repositories for olm-datasets
Users that are interested in olm-datasets are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.β97Feb 9, 2023Updated 3 years ago
- A utility for storing and reading files for Korean LM training πΎβ35Oct 15, 2025Updated 7 months ago
- β11Oct 3, 2021Updated 4 years ago
- All-in-one text de-duplicationβ760Mar 9, 2026Updated 3 months ago
- Code used for sourcing and cleaning the BigScience ROOTS corpusβ318Mar 20, 2023Updated 3 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer β’ AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- This is project for korean auto spacingβ12Aug 3, 2020Updated 5 years ago
- kogptλ₯Ό osloλ‘ νμΈνλνλ μμ .β23Aug 26, 2022Updated 3 years ago
- Long-context pretrained encoder-decoder modelsβ96Oct 28, 2022Updated 3 years ago
- β1,273Jul 30, 2024Updated last year
- Convert Numerical Representations to Korean Pronunciationβ14Apr 20, 2020Updated 6 years ago
- Train π€transformers with DeepSpeed: ZeRO-2, ZeRO-3β23May 20, 2021Updated 5 years ago
- NSMC, KorSTS ... fine-tuningsβ18Feb 23, 2022Updated 4 years ago
- Anh - LAION's multilingual assistant datasets and modelsβ28Apr 5, 2023Updated 3 years ago
- β23Jul 10, 2023Updated 2 years ago
- Wordpress hosting with auto-scaling - Free Trial Offer β’ AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- β10Dec 15, 2022Updated 3 years ago
- Deploy KoGPT with Triton Inference Serverβ14Nov 18, 2022Updated 3 years ago
- An open collection of implementation tips, tricks and resources for training large language modelsβ501Mar 8, 2023Updated 3 years ago
- Machine Generated Captions for Best Artworksβ22Sep 21, 2022Updated 3 years ago
- Adversarial Test Dataset for Korean Multi-turn Response Selectionβ34Dec 16, 2021Updated 4 years ago
- Calculating Expected Time for training LLM.β39Apr 17, 2023Updated 3 years ago
- Training HuggingFace models using fastaiβ11Jul 22, 2021Updated 4 years ago
- TPU support for the fastai libraryβ14Apr 15, 2021Updated 5 years ago
- PyTorch + HuggingFace code for RetoMaton: "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" (ICML 2022), including anβ¦β287Oct 20, 2022Updated 3 years ago
- Deploy on Railway without the complexity - Free Credits Offer β’ AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- All-in-one repository for Fine-tuning & Pretraining (Large) Language Modelsβ15Mar 8, 2023Updated 3 years ago
- Public helpers for huggingface.co. Now lives in https://github.com/huggingface/huggingface_hubβ13Jul 10, 2022Updated 3 years ago
- β184May 26, 2023Updated 3 years ago
- OSLO: Open Source framework for Large-scale model Optimizationβ309Aug 25, 2022Updated 3 years ago
- β16Aug 10, 2022Updated 3 years ago
- An implementation of model parallel autoregressive transformers on GPUs, based on the DeepSpeed library.β21Nov 28, 2022Updated 3 years ago
- A curated list of papers and resources for text-to-image evaluation.β30Sep 6, 2023Updated 2 years ago
- Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M dβ¦β215Aug 28, 2024Updated last year
- Efficient few-shot learning with Sentence Transformersβ2,743May 26, 2026Updated 2 weeks ago
- Bare Metal GPUs on DigitalOcean Gradient AI β’ AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- Task-based datasets, preprocessing, and evaluation for sequence models.β594May 12, 2026Updated last month
- β357Mar 17, 2024Updated 2 years ago
- [ICML 2023] Exploring the Benefits of Training Expert Language Models over Instruction Tuningβ99Apr 26, 2023Updated 3 years ago
- TIFMO: Textual Inference Forward-chaining MOduleβ12Apr 25, 2014Updated 12 years ago
- λͺ¨λμ λ§λμΉ λ°μ΄ν°λ₯Ό λΆμμ νΈλ¦¬ν ννλ‘ λ³ννλ κΈ°λ₯μ μ 곡ν©λλ€.β11Mar 2, 2022Updated 4 years ago
- β20Nov 23, 2022Updated 3 years ago
- ππ€ A collection of templates for Hugging Face Spacesβ34Oct 9, 2023Updated 2 years ago