datalib/libextract

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/datalib/libextract)

datalib / libextract

Extract data from websites using basic statistical magic

☆506

Alternatives and similar repositories for libextract

Users that are interested in libextract are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

rodricios / eatiht
View on GitHub
An exercise in unsupervised machine learning: Extract Article's Text in HTml documents.
☆430Jan 16, 2026Updated 6 months ago
datalib / StatsCounter
View on GitHub
Python's missing statistical Swiss Army knife
☆15Aug 25, 2015Updated 10 years ago
datalib / proclib
View on GitHub
pythonic processes
☆12Jun 12, 2015Updated 11 years ago
rodricios / crawl-to-the-future
View on GitHub
An attempt at creating a gold standard dataset for backtesting yesterday & today's content-extractors
☆35Mar 19, 2015Updated 11 years ago
dragnet-org / dragnet
View on GitHub
Just the facts -- web page content extraction
☆1,274Jul 8, 2025Updated last year
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
ChrisBeaumont / soupy
View on GitHub
Easier wrangling of web data.
☆260Mar 5, 2018Updated 8 years ago
misja / python-boilerpipe
View on GitHub
Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages
☆542Jul 17, 2021Updated 5 years ago
AlexMathew / scrapple
View on GitHub
A framework for creating semi-automatic web content extractors
☆503Jan 16, 2026Updated 6 months ago
kohlschutter / boilerpipe
View on GitHub
Work in progress transmit from Google Code
☆1,127Jan 3, 2018Updated 8 years ago
grangier / python-goose
View on GitHub
Html Content / Article Extractor, web scrapping lib in Python
☆4,100Mar 10, 2026Updated 4 months ago
scrapinghub / mdr
View on GitHub
A python library detect and extract listing data from HTML page.
☆110May 5, 2017Updated 9 years ago
rodricios / autocomplete
View on GitHub
Autocomplete - light-weight, next-word prediction Python utility
☆450Jan 16, 2026Updated 6 months ago
scrapy / scrapely
View on GitHub
A pure-python HTML screen-scraping library
☆1,884Apr 4, 2022Updated 4 years ago
pvdlg / boilerpipe
View on GitHub
Repackaging of Boilerpipe published on Maven Central Repository.
☆54Dec 17, 2023Updated 2 years ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
willf / segment
View on GitHub
A tool to segment text based on frequencies and the Viterbi algorithm "#TheBoyWhoLived" => ['#', 'The', 'Boy', 'Who', 'Lived']
☆79Apr 23, 2016Updated 10 years ago
ushahidi / Chambua
View on GitHub
Chambua is an open-source semantic tagging application that analyses text and extracts names of people, places (& geocodes them), organis…
☆33Nov 12, 2021Updated 4 years ago
walkr / oi
View on GitHub
python library for writing long running processes with a cli interface
☆228Mar 11, 2016Updated 10 years ago
orf / cyborg
View on GitHub
Python web scraping framework
☆310Nov 12, 2017Updated 8 years ago
michaelhelmick / lassie
View on GitHub
Web Content Retrieval for Humans™
☆629Jul 30, 2022Updated 3 years ago
datamade / parserator
View on GitHub
A toolkit for making domain-specific probabilistic parsers
☆811Sep 26, 2024Updated last year
ssteuteville / scrapyz
View on GitHub
"Scrape Easy" - an extension of the Scrapy framework.
☆185Aug 13, 2016Updated 9 years ago
redapple / parslepy
View on GitHub
Python implementation of the Parsley language for extracting structured data from web pages
☆92Oct 26, 2017Updated 8 years ago
buriy / python-readability
View on GitHub
fast python port of arc90's readability tool, updated to match latest readability.js!
☆2,894Jan 26, 2026Updated 5 months ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
scrapinghub / aile
View on GitHub
Automatic Item List Extraction
☆85Jun 15, 2016Updated 10 years ago
ericchiang / cloudy-tweets
View on GitHub
Machine Learning solution for Kaggle.com's "Partly Sunny with a Chance of Hashtags"
☆27Dec 6, 2013Updated 12 years ago
lorien / grab
View on GitHub
Web Scraping Framework
☆2,460Sep 19, 2025Updated 9 months ago
scrapinghub / extruct
View on GitHub
Extract embedded metadata from HTML markup
☆967Apr 1, 2026Updated 3 months ago
jeanphix / Ghost.py
View on GitHub
Webkit based scriptable web browser for python.
☆2,755Feb 24, 2024Updated 2 years ago
harshavardhana / boilerpipy
View on GitHub
Readability/Boilerpipe extraction in Python
☆55May 6, 2016Updated 10 years ago
tomazk / Text-Extraction-Evaluation
View on GitHub
Framework for evaluating text extraction algorithms implemented as web services
☆42Jun 30, 2012Updated 14 years ago
miso-belica / jusText
View on GitHub
Heuristic based boilerplate removal tool
☆818Feb 25, 2025Updated last year
codelucas / newspaper
View on GitHub
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
☆15,100Jul 8, 2026Updated last week
Open source password manager - Proton Pass • Ad
Securely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
machinalis / featureforge
View on GitHub
A set of tools for creating and testing machine learning features, with a scikit-learn compatible API
☆389Dec 26, 2017Updated 8 years ago
scrapinghub / webstruct
View on GitHub
NER toolkit for HTML data
☆259May 3, 2024Updated 2 years ago
davecarpie / scli
View on GitHub
A selectable, scrollable list interface for terminal applications built using curses
☆10Jun 30, 2015Updated 11 years ago
kykamath / streaming_lsh
View on GitHub
A project for clustering text streams using locality-sensitive hashing (LSH) in Python
☆26Sep 23, 2011Updated 14 years ago
yhat / rodeo
View on GitHub
A data science IDE for Python
☆3,893Apr 16, 2018Updated 8 years ago
IndicoDataSolutions / Passage
View on GitHub
A little library for text analysis with RNNs.
☆536Sep 10, 2018Updated 7 years ago
frnsys / broca
View on GitHub
rapid nlp prototyping
☆71Sep 30, 2022Updated 3 years ago