adbar / py3langid
Faster, modernized fork of the language identification tool langid.py
β55Updated 5 months ago
Alternatives and similar repositories for py3langid:
Users that are interested in py3langid are comparing it to the libraries listed below
- Targetted language identifier, based on FastText and Hunspell.β34Updated 2 months ago
- π₯ Use Hugging Face text and token classification pipelines directly in spaCyβ63Updated last year
- Seed Machine Translation Dataβ31Updated 5 months ago
- Source code for the Apple reproductionβ32Updated 4 years ago
- Parse and convert numbers written in French, English, Spanish, Portuguese, German and Catalan into their digit representation.β106Updated last week
- Library and command line utility to do approximate string matching of a source against a bitext index and get matched source and target.β50Updated 2 weeks ago
- Python3 bindings for the Compact Language Detector v3 (CLD3)β151Updated last year
- A simple client for doccano API.β85Updated 11 months ago
- A python true casing utility that restores case information for textsβ88Updated 2 years ago
- Fast and robust date extraction from web pages, with Python or on the command-lineβ126Updated 4 months ago
- Simple multilingual lemmatizer for Python, especially useful for speed and efficiencyβ154Updated 5 months ago
- Efficient teacher-student models and scripts to make themβ50Updated last year
- spaCy-wrap is a wrapper library for spaCy for including fine-tuned transformers from Huggingface in your spaCy pipeline allowing you to iβ¦β46Updated last year
- Tool to fix bitexts and tag near-duplicates for removalβ30Updated 3 months ago
- Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.β157Updated 10 months ago
- Next-generation Punkt sentence boundary detection with zero dependenciesβ17Updated last month
- π Dehyphenation of broken text (mainly German), i.e., extracted from a PDFβ38Updated 3 years ago
- Augmenty is an augmentation library based on spaCy for augmenting texts.β153Updated 11 months ago
- Extracts plain text, language identification and more metadata from WARC recordsβ21Updated 2 months ago
- Bilingual term extractorβ53Updated last year
- A python module for word inflections designed for use with spaCy.β92Updated 5 years ago
- A spaCy wrapper of OpenTapioca for named entity linking on Wikidataβ94Updated 2 years ago
- Bilingual sentence similarity classifier using Tensorflowβ21Updated 5 years ago
- Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)β74Updated last month
- πAn easy-to-use package to restore punctuation of the text.β115Updated 2 years ago
- Correction of spaces with character-based neural language models.β13Updated 2 years ago
- β72Updated last month
- BERT and ELECTRA models trained on Europeana Newspapersβ38Updated 3 years ago
- Generate a SQLite database from Wikipedia & Wikidata dumps.β35Updated last year
- 110k Dutch Book Reviews Dataset for Sentiment Analysisβ29Updated last year