fnl/syntok

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/fnl/syntok)

fnl / syntok

Text tokenization and sentence segmentation (segtok v2)

☆211

Alternatives and similar repositories for syntok

Users that are interested in syntok are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

fnl / segtok
View on GitHub
Segtok v2 is here: https://github.com/fnl/syntok -- A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic fe…
☆171Dec 15, 2021Updated 4 years ago
nipunsadvilkar / pySBD
View on GitHub
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
☆927Aug 20, 2024Updated last year
mediacloud / sentence-splitter
View on GitHub
Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.
☆258Nov 7, 2022Updated 3 years ago
notAI-tech / deepsegment
View on GitHub
A sentence segmenter that actually works!
☆304Aug 18, 2020Updated 5 years ago
kevinlu1248 / pyate
View on GitHub
PYthon Automated Term Extraction
☆318Feb 8, 2023Updated 3 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
pyconll / pyconll
View on GitHub
A minimal, pure Python library to interface with CoNLL-U format files.
☆155Jul 6, 2026Updated 2 weeks ago
segment-any-text / wtpsplit
View on GitHub
Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
☆1,320Jul 6, 2026Updated 2 weeks ago
dbmdz / deep-eos
View on GitHub
General-Purpose Neural Networks for Sentence Boundary Detection
☆74Mar 27, 2023Updated 3 years ago
intfloat / uts
View on GitHub
python package for unsupervised text segmentation.
☆14Oct 31, 2016Updated 9 years ago
jenojp / negspacy
View on GitHub
spaCy pipeline object for negating concepts in text
☆280Apr 20, 2026Updated 3 months ago
webis-de / summary-explorer
View on GitHub
Summary Explorer is a tool to visually explore the state-of-the-art in text summarization.
☆45May 13, 2024Updated 2 years ago
osirrc / jig
View on GitHub
Jig for the Open-Source IR Replicability Challenge (OSIRRC)
☆13Dec 8, 2022Updated 3 years ago
boudinfl / pke
View on GitHub
Python Keyphrase Extraction module
☆1,589Jul 12, 2023Updated 3 years ago
czcorpus / InterText_server
View on GitHub
Collaborative on-line editor for aligned parallel texts.
☆14Jul 2, 2026Updated 3 weeks ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
MMesgar / neural_coherence_model
View on GitHub
EMNLP-18
☆17Dec 21, 2021Updated 4 years ago
chartbeat-labs / textacy
View on GitHub
NLP, before and after spaCy
☆2,239Sep 22, 2023Updated 2 years ago
searchableai / ChainCQG
View on GitHub
☆13Feb 11, 2021Updated 5 years ago
flashxio / knor
View on GitHub
A repo to allow validation of performance results in the knor paper and provide a fast, scalable k-means implementation.
☆15Mar 31, 2020Updated 6 years ago
asahi417 / ConditionalVariationalAutoEncoder
View on GitHub
Implement Conditional VAE and train on MNIST by tensorflow 1.3.0.
☆10Nov 7, 2017Updated 8 years ago
JMendes1995 / py_heideltime
View on GitHub
☆18Nov 19, 2023Updated 2 years ago
rsling / texrex
View on GitHub
texrex web page cleaning & ClaraX random walk crawler
☆11Dec 13, 2021Updated 4 years ago
flairNLP / fabricator
View on GitHub
[EMNLP 2023 Demo] fabricator - annotating and generating datasets with large language models.
☆110May 16, 2024Updated 2 years ago
cslu-nlp / DetectorMorse
View on GitHub
Fast supervised sentence boundary detection using the averaged perceptron
☆90Dec 8, 2018Updated 7 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
adobe / NLP-Cube
View on GitHub
Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing
☆562Nov 3, 2024Updated last year
NorskRegnesentral / NeuralTextSanitizer
View on GitHub
Neural models for detecting and masking personal information from texts
☆16Nov 25, 2022Updated 3 years ago
NorskRegnesentral / skweak
View on GitHub
skweak: A software toolkit for weak supervision applied to NLP tasks
☆925Sep 2, 2024Updated last year
tsproisl / SoMaJo
View on GitHub
A tokenizer and sentence splitter for German and English web and social media texts.
☆153Dec 9, 2024Updated last year
hplt-project / sacremoses
View on GitHub
Python port of Moses tokenizer, truecaser and normalizer
☆497Feb 6, 2026Updated 5 months ago
informagi / GeeseDB
View on GitHub
Graph Engine for Exploration and Search
☆42Jan 26, 2024Updated 2 years ago
msg-systems / holmes-extractor
View on GitHub
Information extraction from English and German texts based on predicate logic
☆395Jul 8, 2022Updated 4 years ago
MaartenGr / PolyFuzz
View on GitHub
Fuzzy string matching, grouping, and evaluation.
☆801Jul 10, 2025Updated last year
biocaddie / elasticsearch-queryexpansion-plugin
View on GitHub
A simple ElasticSearch plugin wrapping around the search endpoint to provide Rocchio query expansion
☆18Aug 5, 2017Updated 8 years ago
End-to-end encrypted cloud storage - Proton Drive • Ad
Special offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
miso-belica / jusText
View on GitHub
Heuristic based boilerplate removal tool
☆819Feb 25, 2025Updated last year
MilaNLProc / contextualized-topic-models
View on GitHub
A python package to run contextualized topic modeling. CTMs combine contextualized embeddings (e.g., BERT) with topic models to get coher…
☆1,272Jul 24, 2025Updated last year
textpipe / textpipe
View on GitHub
Textpipe: clean and extract metadata from text
☆302Jun 9, 2021Updated 5 years ago
bitextor / bicleaner
View on GitHub
Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.
☆160Jun 18, 2024Updated 2 years ago
microsoft / BlingFire
View on GitHub
A lightning fast Finite State machine and REgular expression manipulation library.
☆1,892Dec 8, 2024Updated last year
PKSHATechnology-Research / camphr
View on GitHub
Camphr - NLP libary for creating pipeline components
☆336Dec 9, 2022Updated 3 years ago
bitextor / bitextor
View on GitHub
Bitextor generates translation memories from multilingual websites
☆299Nov 11, 2024Updated last year