scrapinghub / aduanaLinks

Frontera backend to guide a crawl using PageRank, HITS or other ranking algorithms based on the link structure of the web graph, even when making big crawls (one billion pages).

☆55

Alternatives and similar repositories for aduana

Users that are interested in aduana are comparing it to the libraries listed below

Sorting:

scrapinghub / aile
Automatic Item List Extraction
☆87Updated 9 years ago
scrapinghub / page_finder
Find which links on a web page are pagination links
☆29Updated 8 years ago
Parsely / schemato
Modularly extensible semantic metadata validator
☆84Updated 9 years ago
TeamHG-Memex / scrapy-dockerhub
[UNMAINTAINED] Deploy, run and monitor your Scrapy spiders.
☆11Updated 10 years ago
pydepta / pydepta
A python implementation of DEPTA
☆83Updated 8 years ago
piskvorky / gensim-simserver
[NO LONGER MAINTAINED AS OPEN SOURCE - USE SCALETEXT.COM INSTEAD]
☆107Updated 12 years ago
scrapinghub / mdr
A python library detect and extract listing data from HTML page.
☆108Updated 8 years ago
scrapinghub / webpager
Paginating the web
☆37Updated 11 years ago
scrapinghub / kafka-scanner
High Level Kafka Scanner
☆19Updated 8 years ago
TeamHG-Memex / sitehound-frontend
Site Hound (previously THH) is a Domain Discovery Tool
☆23Updated 4 years ago
scrapinghub / webstruct
NER toolkit for HTML data
☆259Updated last year
rmax / scrapy-boilerplate
Small set of utilities to simplify writing Scrapy spiders.
☆49Updated 10 years ago
xtannier / WebAnnotator
WebAnnotator is a tool for annotating Web pages. WebAnnotator is implemented as a Firefox extension (https://addons.mozilla.org/en-US/fi…
☆48Updated 3 years ago
TeamHG-Memex / autologin-middleware
Scrapy middleware for the autologin
☆36Updated 7 years ago
TeamHG-Memex / MaybeDont
A component that tries to avoid downloading duplicate content
☆27Updated 7 years ago
Parsely / serpextract
Easy extraction of keywords and engines from search engine results pages (SERPs).
☆92Updated last month
gtoonstra / remap
MapReduce platform in python
☆34Updated 10 years ago
Parsely / probably
Probabilistic Data Structures in Python (originally presented at PyData 2013)
☆55Updated 3 years ago
commonsearch / gumbocy
Python binding for gumbo-parser using Cython
☆14Updated 9 years ago
redapple / parslepy
Python implementation of the Parsley language for extracting structured data from web pages
☆92Updated 8 years ago
scrapinghub / flatson
Tool to flatten stream of JSON-like objects, configured via schema
☆33Updated 6 years ago
scrapinghub / page_clustering
A simple algorithm for clustering web pages, suitable for crawlers
☆34Updated 8 years ago
TeamHG-Memex / scrapy-crawl-once
Scrapy middleware which allows to crawl only new content
☆79Updated 3 years ago
cocrawler / cocrawler
CoCrawler is a versatile web crawler built using modern tools and concurrency.
☆191Updated 3 years ago
vu3jej / scrapy-corenlp
☆59Updated 4 years ago
TeamHG-Memex / extract-html-diff
extract difference between two html pages
☆32Updated 7 years ago
TeamHG-Memex / autopager
Detect and classify pagination links
☆103Updated last month
nik0spapp / sdalg
Web page segmentation and noise removal
☆55Updated last year
ContinuumIO / topik
A Topic Modeling toolbox
☆92Updated 9 years ago
scrapy-plugins / scrapy-streaming
☆18Updated 9 years ago