miso-belica/jusText

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/miso-belica/jusText)

miso-belica / jusText

Heuristic based boilerplate removal tool

☆819

Alternatives and similar repositories for jusText

Users that are interested in jusText are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

dragnet-org / dragnet
View on GitHub
Just the facts -- web page content extraction
☆1,274Jul 8, 2025Updated last year
adbar / trafilatura
View on GitHub
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XM…
☆6,337Jul 18, 2026Updated last week
weblyzard / inscriptis
View on GitHub
A python based HTML to text conversion library, command line client and Web service.
☆345Updated this week
dkpro / dkpro-c4corpus
View on GitHub
DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…
☆53Jun 12, 2020Updated 6 years ago
misja / python-boilerpipe
View on GitHub
Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages
☆542Jul 17, 2021Updated 5 years ago
Open source password manager - Proton Pass • Ad
Securely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
jmriebold / BoilerPy3
View on GitHub
Python port of Boilerpipe library
☆96Aug 20, 2024Updated last year
goose3 / goose3
View on GitHub
A Python 3 compatible version of goose http://goose3.readthedocs.io/en/latest/index.html
☆912Updated this week
buriy / python-readability
View on GitHub
fast python port of arc90's readability tool, updated to match latest readability.js!
☆2,895Jan 26, 2026Updated 6 months ago
dalab / web2text
View on GitHub
Source code for the paper "Web2Text: Deep Structured Boilerplate Removal", full paper @ ECIR'18
☆169Oct 28, 2021Updated 4 years ago
adbar / courlan
View on GitHub
Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
☆178Updated this week
scrapinghub / article-extraction-benchmark
View on GitHub
Article extraction benchmark: dataset and evaluation scripts
☆376May 29, 2026Updated last month
miso-belica / sumy
View on GitHub
Module for automatic summarization of text documents and HTML pages.
☆3,696Updated this week
codelucas / newspaper
View on GitHub
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
☆15,123Updated this week
kohlschutter / boilerpipe
View on GitHub
Work in progress transmit from Google Code
☆1,126Jan 3, 2018Updated 8 years ago
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
arnav1993k / ImageEnhancement
View on GitHub
☆12Apr 5, 2019Updated 7 years ago
opensanctions / fingerprints
View on GitHub
Now included in rigour
☆150Nov 24, 2025Updated 8 months ago
rodricios / eatiht
View on GitHub
An exercise in unsupervised machine learning: Extract Article's Text in HTml documents.
☆430Jan 16, 2026Updated 6 months ago
grangier / python-goose
View on GitHub
Html Content / Article Extractor, web scrapping lib in Python
☆4,101Mar 10, 2026Updated 4 months ago
chatnoir-eu / chatnoir-resiliparse
View on GitHub
A robust web archive analytics toolkit
☆144Updated this week
facebookresearch / cc_net
View on GitHub
Tools to download and cleanup Common Crawl data
☆1,047Apr 25, 2023Updated 3 years ago
Alir3z4 / html2text
View on GitHub
Convert HTML to Markdown-formatted text.
☆2,169Oct 28, 2025Updated 8 months ago
fhamborg / news-please
View on GitHub
news-please - an integrated web crawler and information extractor for news that just works
☆2,472Apr 14, 2026Updated 3 months ago
ekzhu / datasketch
View on GitHub
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
☆2,943Updated this week
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
Bollegala / DARep
View on GitHub
Cross-domain word representation learning
☆10May 23, 2015Updated 11 years ago
fnl / syntok
View on GitHub
Text tokenization and sentence segmentation (segtok v2)
☆211Mar 12, 2022Updated 4 years ago
currentslab / extractnet
View on GitHub
A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one pac…
☆299May 19, 2025Updated last year
chartbeat-labs / textacy
View on GitHub
NLP, before and after spaCy
☆2,239Sep 22, 2023Updated 2 years ago
clips / pattern
View on GitHub
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
☆8,857Jun 10, 2024Updated 2 years ago
dedupeio / dedupe
View on GitHub
A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
☆4,487Jul 29, 2025Updated 11 months ago
nipunsadvilkar / pySBD
View on GitHub
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
☆927Aug 20, 2024Updated last year
plasticityai / magnitude
View on GitHub
A fast, efficient universal vector embedding utility package.
☆1,666Aug 3, 2023Updated 2 years ago
datalib / libextract
View on GitHub
Extract data from websites using basic statistical magic
☆504Oct 2, 2020Updated 5 years ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
oscar-project / ungoliant
View on GitHub
The pipeline for the OSCAR corpus
☆178Nov 9, 2025Updated 8 months ago
Priberam / exconsumm
View on GitHub
Extractive and Compressive Neural Summarization Based on Summary State Representations (NAACL 2019)
☆16May 12, 2020Updated 6 years ago
alea-institute / nupunkt
View on GitHub
Next-generation Punkt sentence boundary detection with zero dependencies
☆32Nov 18, 2025Updated 8 months ago
fnl / segtok
View on GitHub
Segtok v2 is here: https://github.com/fnl/syntok -- A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic fe…
☆171Dec 15, 2021Updated 4 years ago
flairNLP / flair
View on GitHub
A very simple framework for state-of-the-art Natural Language Processing (NLP)
☆14,382Oct 27, 2025Updated 8 months ago
rspeer / python-ftfy
View on GitHub
Fixes mojibake and other glitches in Unicode text, after the fact.
☆4,051Oct 30, 2024Updated last year
scrapinghub / extruct
View on GitHub
Extract embedded metadata from HTML markup
☆967Apr 1, 2026Updated 3 months ago