LeapBeyond/scrubadub

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/LeapBeyond/scrubadub)

LeapBeyond / scrubadub

Clean personally identifiable information from dirty dirty text.

☆431

Alternatives and similar repositories for scrubadub

Users that are interested in scrubadub are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

vnaydionov / pdftables
View on GitHub
pdftables
☆17Aug 21, 2017Updated 8 years ago
sachin-philip / beautifier
View on GitHub
Simple library to cleanup and prettify url patterns and emails
☆138Jul 10, 2022Updated 4 years ago
NathanEpstein / Dora
View on GitHub
Tools for exploratory data analysis in Python
☆647Aug 5, 2025Updated 11 months ago
HHammond / PrettyPandas
View on GitHub
A Pandas Styler class for making beautiful tables
☆413Jan 8, 2023Updated 3 years ago
rhiever / datacleaner
View on GitHub
A Python tool that automatically cleans data sets and readies them for analysis.
☆1,080May 22, 2019Updated 7 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
datamade / probablepeople
View on GitHub
a python library for parsing unstructured western names into name components.
☆622May 15, 2025Updated last year
datamade / parserator
View on GitHub
A toolkit for making domain-specific probabilistic parsers
☆812Sep 26, 2024Updated last year
jamesturk / jellyfish
View on GitHub
🪼 a python library for doing approximate and phonetic matching of strings.
☆2,227Updated this week
rspeer / python-ftfy
View on GitHub
Fixes mojibake and other glitches in Unicode text, after the fact.
☆4,051Oct 30, 2024Updated last year
henryre / shalo
View on GitHub
Shallow baseline models for text in TensorFlow
☆12Jul 1, 2017Updated 9 years ago
dedupeio / dedupe
View on GitHub
A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
☆4,487Jul 29, 2025Updated last year
engarde-dev / engarde
View on GitHub
A library for defensive data analysis.
☆499Jan 6, 2020Updated 6 years ago
ianozsvald / learning_text_transformer_demo
View on GitHub
Demo code for learning_text_transformer
☆25Feb 22, 2015Updated 11 years ago
LeapBeyond / scrubadub_spacy
View on GitHub
Clean personally identifiable information from dirty dirty text using spaCy.
☆41Sep 1, 2023Updated 2 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
datamade / usaddress
View on GitHub
a python library for parsing unstructured United States address strings into address components
☆1,634Aug 7, 2025Updated 11 months ago
snipsco / snips-nlu-metrics
View on GitHub
Python package to compute metrics on an NLU intent parsing pipeline
☆13Mar 10, 2020Updated 6 years ago
ahalterman / multiuser_prodigy
View on GitHub
Running Prodigy for a team of annotators
☆52Jan 8, 2021Updated 5 years ago
seatgeek / fuzzywuzzy
View on GitHub
Fuzzy String Matching in Python
☆9,262Feb 24, 2023Updated 3 years ago
harshnisar / badfish
View on GitHub
Badfish - A missing data analysis and wrangling library in Python
☆18Oct 24, 2016Updated 9 years ago
polyaxon / traceml
View on GitHub
Engine for AI/ML/Data tracking, visualization, explainability, drift detection, and dashboards for Polyaxon.
☆534Jun 17, 2026Updated last month
piskvorky / smart_open
View on GitHub
Utils for streaming large files (S3, HDFS, gzip, bz2...)
☆3,453Jul 15, 2026Updated 2 weeks ago
arrow-py / arrow
View on GitHub
🏹 Better dates & times for Python
☆9,050Jun 22, 2026Updated last month
joke2k / faker
View on GitHub
Faker is a Python package that generates fake data for you.
☆19,344Updated this week
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
MaartenGr / PolyFuzz
View on GitHub
Fuzzy string matching, grouping, and evaluation.
☆801Jul 10, 2025Updated last year
TeamHG-Memex / MaybeDont
View on GitHub
A component that tries to avoid downloading duplicate content
☆28Apr 8, 2026Updated 3 months ago
chartbeat-labs / textacy
View on GitHub
NLP, before and after spaCy
☆2,239Sep 22, 2023Updated 2 years ago
scrapinghub / webpager
View on GitHub
Paginating the web
☆37Feb 11, 2014Updated 12 years ago
HypothesisWorks / hypothesis
View on GitHub
The property-based testing library for Python
☆8,826Updated this week
wellcometrust / WellcomeML
View on GitHub
Retired repository for Machine Learning utils at the Wellcome Trust (now deprecated).
☆31Aug 9, 2023Updated 2 years ago
datalib / StatsCounter
View on GitHub
Python's missing statistical Swiss Army knife
☆15Aug 25, 2015Updated 10 years ago
kootenpv / natura
View on GitHub
Find currencies / money talk in natural text
☆15Oct 27, 2021Updated 4 years ago
jtushman / dict_digger
View on GitHub
Digs into Dicts (lists and tuples)
☆15Jun 23, 2015Updated 11 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
bmabey / pyLDAvis
View on GitHub
Python library for interactive topic model visualization. Port of the R LDAvis package.
☆1,852Dec 4, 2025Updated 7 months ago
mortehu / text-classifier
View on GitHub
Creates models to classify documents into categories
☆65Sep 30, 2017Updated 8 years ago
DerwenAI / pytextrank
View on GitHub
Python implementation of TextRank algorithms ("textgraphs") for phrase extraction
☆2,219Jun 24, 2026Updated last month
ContinuumIO / topik
View on GitHub
A Topic Modeling toolbox
☆93Apr 26, 2016Updated 10 years ago
JunjieHu / amber
View on GitHub
Explicit Alignment Objectives for Multilingual Bidirectional Encoders
☆14Apr 14, 2021Updated 5 years ago
jwkvam / bowtie
View on GitHub
Create a dashboard with python!
☆766Sep 9, 2019Updated 6 years ago
dedupeio / doublemetaphone
View on GitHub
Python wrapper for a C++ Double Metaphone
☆15Jan 12, 2026Updated 6 months ago