scrapinghub/aile

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/scrapinghub/aile)

scrapinghub / aile

Automatic Item List Extraction

☆85

Alternatives and similar repositories for aile

Users that are interested in aile are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

pydepta / pydepta
View on GitHub
A python implementation of DEPTA
☆84Jan 14, 2017Updated 9 years ago
scrapinghub / webstruct
View on GitHub
NER toolkit for HTML data
☆259May 3, 2024Updated 2 years ago
scrapinghub / page_finder
View on GitHub
Find which links on a web page are pagination links
☆29Jan 12, 2017Updated 9 years ago
scrapinghub / page_clustering
View on GitHub
A simple algorithm for clustering web pages, suitable for crawlers
☆33Mar 6, 2017Updated 9 years ago
raidikalu / raidikalu
View on GitHub
Listaa raideja ja silleen
☆16Nov 2, 2022Updated 3 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
scrapinghub / autopager
View on GitHub
Detect and classify pagination links
☆15Sep 9, 2020Updated 5 years ago
scrapy / scrapely
View on GitHub
A pure-python HTML screen-scraping library
☆1,884Apr 4, 2022Updated 4 years ago
scrapy-plugins / scrapy-monkeylearn
View on GitHub
A Scrapy pipeline to categorize items using MonkeyLearn
☆38Apr 28, 2017Updated 9 years ago
scrapinghub / aduana
View on GitHub
Frontera backend to guide a crawl using PageRank, HITS or other ranking algorithms based on the link structure of the web graph, even whe…
☆54May 21, 2024Updated 2 years ago
scrapinghub / kafka-scanner
View on GitHub
High Level Kafka Scanner
☆19Sep 29, 2017Updated 8 years ago
scrapinghub / product-extraction-benchmark
View on GitHub
☆16Apr 10, 2026Updated 3 months ago
rmax / databrewer
View on GitHub
The missing datasets manager. Like hombrew but for datasets. CLI-tool for search and discover datasets!
☆41May 29, 2017Updated 9 years ago
aGHz / structominer
View on GitHub
Data scraping for a more civilized age
☆17Jun 12, 2014Updated 12 years ago
seagatesoft / webdext
View on GitHub
Intelligent Web Data Extractor
☆74Dec 5, 2022Updated 3 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
TeamHG-Memex / extract-html-diff
View on GitHub
extract difference between two html pages
☆33Apr 8, 2026Updated 3 months ago
commonsearch / gumbocy
View on GitHub
Python binding for gumbo-parser using Cython
☆14Aug 16, 2016Updated 9 years ago
bkj / wit
View on GitHub
Algorithms for "schema matching"
☆26Jul 6, 2016Updated 10 years ago
ArturGaspar / scrapy-qtwebkit
View on GitHub
☆13Dec 4, 2019Updated 6 years ago
stummjr / scrapy-fieldstats
View on GitHub
A Scrapy extension to log items coverage when the spider shuts down
☆18Apr 11, 2020Updated 6 years ago
TeamHG-Memex / html-text
View on GitHub
Extract text from HTML
☆135Apr 8, 2026Updated 3 months ago
scrapinghub / skinfer
View on GitHub
Skinfer is a tool for inferring and merging JSON schemas
☆141Apr 24, 2024Updated 2 years ago
rmax / scrapydo
View on GitHub
Crochet-based blocking API for Scrapy.
☆47Feb 24, 2017Updated 9 years ago
TeamHG-Memex / undercrawler
View on GitHub
A generic crawler
☆81Apr 8, 2026Updated 3 months ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
TeamHG-Memex / url-summary
View on GitHub
Show summary of a large number of URLs in a Jupyter Notebook
☆19Apr 8, 2026Updated 3 months ago
redapple / parslepy
View on GitHub
Python implementation of the Parsley language for extracting structured data from web pages
☆92Oct 26, 2017Updated 8 years ago
scrapinghub / python-cld2
View on GitHub
Python bindings for CLD2.
☆17Aug 9, 2018Updated 7 years ago
commonsearch / urlparse4
View on GitHub
Faster replacement for Python's urlparse module
☆46Apr 13, 2026Updated 3 months ago
TeamHG-Memex / autopager
View on GitHub
Detect and classify pagination links
☆107Apr 8, 2026Updated 3 months ago
seagatesoft / sde
View on GitHub
Structured Data Extractor. An application to extract structured data from web pages. It uses Data Extraction Based on Partial Tree Alignm…
☆50Jun 9, 2012Updated 14 years ago
TeamHG-Memex / soft404
View on GitHub
A classifier for detecting soft 404 pages
☆65Apr 8, 2026Updated 3 months ago
TeamHG-Memex / arachnado
View on GitHub
Web Crawling UI and HTTP API, based on Scrapy and Tornado
☆162Apr 8, 2026Updated 3 months ago
scrapinghub / extruct
View on GitHub
Extract embedded metadata from HTML markup
☆966Apr 1, 2026Updated 3 months ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
xtannier / WebAnnotator
View on GitHub
WebAnnotator is a tool for annotating Web pages. WebAnnotator is implemented as a Firefox extension (https://addons.mozilla.org/en-US/fi…
☆48Dec 17, 2021Updated 4 years ago
scrapy-plugins / scrapy-streaming
View on GitHub
☆19Oct 12, 2016Updated 9 years ago
scrapinghub / autologin
View on GitHub
A project to attempt to automatically login to a website given a single seed
☆11Jun 17, 2024Updated 2 years ago
EducationalTestingService / match
View on GitHub
Match tokenized words and phrases within the original, untokenized, often messy, text.
☆19Apr 11, 2023Updated 3 years ago
nudge / schema
View on GitHub
A Python implementation of SCHEMA - An Algorithm for Automated Product Taxonomy Mapping in E-commerce.
☆16Feb 3, 2015Updated 11 years ago
scrapinghub / testspiders
View on GitHub
Useful test spiders for Scrapy
☆184Jan 20, 2020Updated 6 years ago
fukamachi / yapool
View on GitHub
A Common Lisp command-line tool for executing shell commands via SSH.
☆12Jul 14, 2015Updated 11 years ago