MiniXC / opensubtitles-dataloader
Loads OpenSubtitles v2018 dataset without having to load everything into memory at once. Works well with pytorch.
☆13Updated 4 years ago
Related projects: ⓘ
- MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert…☆48Updated 3 years ago
- An extension package of 🤗 Datasets that provides support for executing arbitrary SQL queries on HF datasets☆31Updated 7 months ago
- spaCy match and replace, maintaining conjugation☆34Updated last year
- A file utility for accessing both local and remote files through a unified interface.☆36Updated last month
- Extremely easy to use sequence to sequence library with attention, for text to text conversion tasks.☆39Updated 3 years ago
- Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.☆73Updated last week
- Documentation effort for the BookCorpus dataset☆30Updated 3 years ago
- Question Generation - Question Answering for Automatic Flashcards☆64Updated 2 years ago
- LTG-Bert☆25Updated 8 months ago
- Our open source implementation of MiniLMv2 (https://aclanthology.org/2021.findings-acl.188)☆59Updated last year
- ☆30Updated 4 years ago
- Python Finite-State Toolkit☆39Updated last month
- Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.☆91Updated last year
- A python module for word inflections designed for use with spaCy.☆90Updated 4 years ago
- HomebrewNLP in JAX flavour for maintable TPU-Training☆46Updated 8 months ago
- URL downloader supporting checkpointing and continuous checksumming.☆19Updated 9 months ago
- Efficiently computing & storing token n-grams from large corpora☆15Updated 2 weeks ago
- Test prompts for GPT-J-6B and the resulting AI-generated texts☆53Updated 3 years ago
- This repository contains the code for "Generating Datasets with Pretrained Language Models".☆188Updated 3 years ago
- A Benchmark Dataset for Understanding Disfluencies in Question Answering☆60Updated 3 years ago
- The official code of LM-Debugger, an interactive tool for inspection and intervention in transformer-based language models.☆168Updated 2 years ago
- downloads and parses subtitle dataset from opensubtitles.org☆16Updated 5 months ago
- Generate a SQLite database from Wikipedia & Wikidata dumps.☆30Updated 5 months ago
- Build a dialog dataset from online books in many languages☆71Updated last year
- GlotLID: Language Identification with Support for More Than 2000 Labels -- EMNLP 2023☆85Updated 2 months ago
- 💥 Use Hugging Face text and token classification pipelines directly in spaCy☆61Updated 6 months ago
- [EMNLP'23] Official Code for "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models"☆27Updated last week
- **ARCHIVED** Filesystem interface to 🤗 Hub☆56Updated last year
- Reduce the size of pretrained Hugging Face models via vocabulary trimming.☆39Updated last year
- Custom Natural Language Processing with big and small models 🌲🌱☆68Updated 3 years ago