wikimedia / sentencexLinks
A sentence segmentation library with wide language support optimized for speed and utility.
☆73Updated 3 weeks ago
Alternatives and similar repositories for sentencex
Users that are interested in sentencex are comparing it to the libraries listed below
Sorting:
- Extracts plain text, language identification and more metadata from WARC records☆23Updated 2 months ago
- Library for fast text representation and classification.☆31Updated last year
- Faster, modernized fork of the language identification tool langid.py☆61Updated last year
- OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.☆53Updated 2 months ago
- Seed Machine Translation Data☆33Updated last year
- Targetted language identifier, based on FastText and Hunspell.☆38Updated 3 months ago
- Simple multilingual lemmatizer for Python, especially useful for speed and efficiency☆180Updated 6 months ago
- The Open Parallel Corpus☆79Updated last week
- Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)☆74Updated 8 months ago
- Aksharamukha Python Library☆55Updated 10 months ago
- an approximate string matching or fuzzy-matching system for spelling correction, normalisation or post-OCR correction (mirror of https://…☆37Updated 2 months ago
- Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ …☆65Updated this week
- Multilingual sentence alignment using sentence embeddings☆131Updated last year
- A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF …☆69Updated 5 years ago
- 💥 Use Hugging Face text and token classification pipelines directly in spaCy☆63Updated last year
- Logical structure analysis for visually structured documents☆94Updated 3 years ago
- 🕸 GlotWeb: Web Indexing for Low-Resource Languages -- under construction.☆16Updated 3 months ago
- A list of awesome Machine Translation frameworks, libraries, software and papers☆194Updated last year
- OpusFilter - Parallel corpus processing toolkit☆113Updated 3 weeks ago
- Efficient teacher-student models and scripts to make them☆52Updated last year
- 80x faster and 95% accurate language identification with Fasttext☆163Updated last year
- This repository contains an easy and intuitive approach to use SetFit in combination with spaCy.☆80Updated 2 years ago
- These are lists for a variety of languages containing words that are distinctive to each language.☆38Updated 3 years ago
- SpaCyEx allows the creation of spaCy Matcher patterns with RegEx like syntax.☆59Updated last year
- A modern, interlingual wordnet interface for Python☆276Updated this week
- Download and load spaCy models on-the-fly☆15Updated 2 years ago
- A polite and user-friendly downloader for Common Crawl data☆63Updated 3 months ago
- AfroLID, a powerful neural toolkit for African languages identification which covers 517 African languages.☆34Updated 9 months ago
- Benchmark scripts for comparing different tokenizers and sentence segmenters of German☆12Updated 2 years ago
- ☆55Updated last year