adbar / py3langid
Faster, modernized fork of the language identification tool langid.py
β55Updated 4 months ago
Alternatives and similar repositories for py3langid:
Users that are interested in py3langid are comparing it to the libraries listed below
- A sentence segmentation library with wide language support optimized for speed and utility.β59Updated 6 months ago
- π₯ Use Hugging Face text and token classification pipelines directly in spaCyβ63Updated last year
- 80x faster and 95% accurate language identification with Fasttextβ151Updated last year
- Simple multilingual lemmatizer for Python, especially useful for speed and efficiencyβ154Updated 4 months ago
- Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic prβ¦β68Updated last month
- A versioned python wrapper package for cmudict (https://github.com/cmusphinx/cmudict).β61Updated 2 weeks ago
- A tiny BERT for low-resource monolingual modelsβ31Updated 5 months ago
- Extracts plain text, language identification and more metadata from WARC recordsβ21Updated 3 weeks ago
- Python3 bindings for the Compact Language Detector v3 (CLD3)β150Updated last year
- β72Updated last month
- Fast and robust date extraction from web pages, with Python or on the command-lineβ123Updated 2 months ago
- Parse and convert numbers written in French, English, Spanish, Portuguese, German and Catalan into their digit representation.β105Updated last month
- Targetted language identifier, based on FastText and Hunspell.β34Updated last month
- Seed Machine Translation Dataβ30Updated 4 months ago
- an experimental implementation of Burrow's delta in Python 3β21Updated 3 years ago
- Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)β70Updated 11 months ago
- Source code for the Apple reproductionβ32Updated 3 years ago
- πAn easy-to-use package to restore punctuation of the text.β114Updated last year
- π’ Work with static vector modelsβ22Updated 2 months ago
- spaCy-wrap is a wrapper library for spaCy for including fine-tuned transformers from Huggingface in your spaCy pipeline allowing you to iβ¦β46Updated 11 months ago
- Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2β¦β67Updated 2 years ago
- Efficient teacher-student models and scripts to make themβ50Updated last year
- Searching in-memory corpus with Corpus Query Language (CQL)β19Updated 3 months ago
- Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.β155Updated 9 months ago
- Multilingual sentence alignment using sentence embeddingsβ113Updated 4 months ago
- A python true casing utility that restores case information for textsβ88Updated 2 years ago
- Fast and accurate natural language detection. Detector written in Python. Nito-ELD, ELD.β16Updated last year
- Python Finite-State Toolkitβ53Updated last month
- Multilingual syllable annotation pipeline component for spacyβ39Updated 2 years ago
- Translation demonstratorβ32Updated 4 years ago