lethain/extraction

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/lethain/extraction)

lethain / extraction

A Python library for extracting titles, images, descriptions and canonical urls from HTML.

☆152

Alternatives and similar repositories for extraction

Users that are interested in extraction are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

kmike / morphine
View on GitHub
[experiment] CRF-based disambiguation engine for pymorphy2
☆10May 9, 2016Updated 10 years ago
alexeygrigorev / cikm-cup-2016-cross-device
View on GitHub
Solution for the Cross-Device linking challenge from CIKM CUP 2016
☆24Dec 6, 2016Updated 9 years ago
alno / batch-learn
View on GitHub
☆49Apr 17, 2018Updated 8 years ago
adamfabish / Reduction
View on GitHub
Reduction is a python script which automatically summarizes a text by extracting the sentences which are deemed to be most important.
☆54Mar 8, 2015Updated 11 years ago
Peeragogy / peeragogy-handbook
View on GitHub
download free ebook or buy a paper copy
☆19Mar 28, 2020Updated 6 years ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
andreiolariu / deelearning-hackerearth
View on GitHub
Code for the Deep Learning HackerEarth Challenge #1
☆12Nov 1, 2017Updated 8 years ago
Webhose / article-date-extractor
View on GitHub
Automatically extracts and normalizes an online article or blog post publication date
☆120Aug 10, 2023Updated 2 years ago
Parsely / python-nlp-slides
View on GitHub
Slides to learn a little natural language processing (NLP) with Python. Written in reST with S5/Docutils.
☆29Oct 27, 2012Updated 13 years ago
iaramer / dobbi
View on GitHub
An open-source NLP library: fast text cleaning and preprocessing
☆23Nov 9, 2021Updated 4 years ago
Shmuma / nlp
View on GitHub
Various NLP-related stuff
☆10Apr 13, 2017Updated 9 years ago
scrapinghub / page_finder
View on GitHub
Find which links on a web page are pagination links
☆29Jan 12, 2017Updated 9 years ago
paulperry / kaggle
View on GitHub
Kaggle competition results
☆20Jan 4, 2019Updated 7 years ago
divio / django-login-as
View on GitHub
Log in as any user in django (if you're a superuser)
☆26May 22, 2014Updated 12 years ago
HearstCorp / django-jsonbfield
View on GitHub
PostgreSQL JSONB field support in Django
☆18Nov 9, 2016Updated 9 years ago
Bare Metal GPUs on DigitalOcean Gradient AI • Ad
Purpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
clearstorydata-cookbooks / apache_spark
View on GitHub
A cookbook for installing and configuring Apache Spark
☆11Sep 6, 2018Updated 7 years ago
ArkadiyD / CythonXGB
View on GitHub
Fast one-sample prediction for XGBoost for usage with Cython
☆70Jul 21, 2017Updated 8 years ago
lfsimoes / mars_express__esn
View on GitHub
Tackling ESA's Mars Express Power Challenge with Echo State Networks
☆11Jun 28, 2018Updated 8 years ago
pushcx / barnacl.es
View on GitHub
Rails code running the Barnacles link aggregation site
☆16Nov 3, 2019Updated 6 years ago
dragnet-org / dragnet
View on GitHub
Just the facts -- web page content extraction
☆1,274Jul 8, 2025Updated last year
semanticize / semanticizest
View on GitHub
Standalone Semanticizer
☆32Mar 4, 2015Updated 11 years ago
kashyap32 / Sign-Recognition
View on GitHub
Traffic Sign Recognition with Keras.
☆19Jun 23, 2017Updated 9 years ago
escaped / django-exiffield
View on GitHub
django-exiffield extracts exif data by utilizing the exiftool
☆14Sep 7, 2021Updated 4 years ago
5vision / blackbox
View on GitHub
☆12Jun 5, 2016Updated 10 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
miha-skalic / convolutedPredictions_Cdiscount
View on GitHub
2nd place solution to Kaggle's Cdiscount image classification challange.
☆18Mar 7, 2018Updated 8 years ago
labnol / apps-script-samples
View on GitHub
Apps Script samples for G Suite products.
☆16Nov 6, 2020Updated 5 years ago
evansd / django-envsettings
View on GitHub
One-stop shop for configuring 12-factor Django apps
☆10Aug 13, 2015Updated 10 years ago
NISH1001 / machine-learning-into-the-void
View on GitHub
Let's learn ML and dive into the void
☆17Nov 2, 2020Updated 5 years ago
lopuhin / python-adagram
View on GitHub
AdaGram (adaptive skip-gram) for Python
☆74May 9, 2017Updated 9 years ago
TeamHG-Memex / soft404
View on GitHub
A classifier for detecting soft 404 pages
☆63Apr 8, 2026Updated 3 months ago
s2krish / django-restify
View on GitHub
Turn your Django project into RESTFul APIs in a minute.
☆17Dec 8, 2015Updated 10 years ago
cortwave / cdiscount-kaggle
View on GitHub
https://www.kaggle.com/c/cdiscount-image-classification-challenge
☆19Dec 28, 2017Updated 8 years ago
ianramzy / article-summary-deep-learning
View on GitHub
📖 Using deep learning and scraping to analyze/summarize articles! Just drop in any URL!
☆19Dec 8, 2022Updated 3 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
javalurker / sae-flask-blog
View on GitHub
一款运行在SAE Python上采用FLASK开发的轻型博客程序
☆20Aug 23, 2012Updated 13 years ago
sonnylaskar / Competitions
View on GitHub
Competition repository
☆21Oct 8, 2019Updated 6 years ago
bhupinders / DS-Competitions
View on GitHub
Machine Learning Competitions
☆15Mar 27, 2017Updated 9 years ago
priyanka-kasture / Handwritten-Digit-Recognizer
View on GitHub
Handwritten Digit Recognition using Softmax Regression in Python
☆13Sep 5, 2018Updated 7 years ago
lucianoratamero / django_apistar
View on GitHub
Django App to integrate API Star's routes and views into Django's ecossystem.
☆23Sep 18, 2018Updated 7 years ago
datalib / libextract
View on GitHub
Extract data from websites using basic statistical magic
☆506Oct 2, 2020Updated 5 years ago
ledil / django-orphaned
View on GitHub
delete all orphaned files created by django (FileField)
☆39Sep 16, 2022Updated 3 years ago