mbanon / fastspell
Targetted language identifier, based on FastText and Hunspell.
☆29Updated last month
Related projects ⓘ
Alternatives and complementary repositories for fastspell
- Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)☆67Updated 6 months ago
- Tool to fix bitexts and tag near-duplicates for removal☆29Updated 3 months ago
- Bicleaner fork that uses neural networks☆38Updated 3 months ago
- These are lists for a variety of languages containing words that are distinctive to each language.☆34Updated 2 years ago
- Extracts plain text, language identification and more metadata from WARC records☆20Updated 3 months ago
- MAMMOTH: MAssively Multilingual Modular Open Translation @ Helsinki☆22Updated this week
- Python Finite-State Toolkit☆45Updated last week
- Faster, modernized fork of the language identification tool langid.py☆48Updated 5 months ago
- OpusFilter - Parallel corpus processing toolkit☆102Updated 3 months ago
- Library for fast text representation and classification.☆28Updated 10 months ago
- AfroLID, a powerful neural toolkit for African languages identification which covers 517 African languages.☆28Updated last year
- Source code for the Apple reproduction☆31Updated 3 years ago
- Transform TMX to text☆29Updated last year
- OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.☆48Updated 2 months ago
- An easy-to-use library to linguistically compare one sentence and its words to another, in the same language or a different one. For inst…☆21Updated 2 years ago
- Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.☆150Updated 5 months ago
- BERT and ELECTRA models trained on Europeana Newspapers☆36Updated 2 years ago
- A tiny BERT for low-resource monolingual models☆29Updated last month
- Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.☆75Updated 2 months ago
- Bilingual term extractor☆52Updated 11 months ago
- 💥 Use Hugging Face text and token classification pipelines directly in spaCy☆62Updated 8 months ago
- A survey of corpora for Germanic low-resource languages and dialects☆24Updated 3 months ago
- Augmenty is an augmentation library based on spaCy for augmenting texts.☆151Updated 5 months ago
- Curriculum training☆16Updated 2 months ago
- NTREX -- News Test References for MT Evaluation☆75Updated 5 months ago
- Searching in-memory corpus with Corpus Query Language (CQL)☆18Updated 3 years ago
- ☆67Updated 3 months ago
- In the wild extraction of entities that are found using Flair and displayed using a very elegant front-end.☆69Updated last year
- ☆22Updated last year
- Caucasus languages focused multilingual and monolingual corpuses for Natural Language Processing(NLP)☆33Updated 2 weeks ago