TeamHG-Memex / sitehound-frontendLinks

Site Hound (previously THH) is a Domain Discovery Tool

☆23

Alternatives and similar repositories for sitehound-frontend

Users that are interested in sitehound-frontend are comparing it to the libraries listed below

Sorting:

TeamHG-Memex / extract-html-diff
extract difference between two html pages
☆32Updated 7 years ago
istresearch / traptor
Traptor -- A distributed Twitter feed
☆26Updated 3 years ago
scrapinghub / exporters
Exporters is an extensible export pipeline library that supports filter, transform and several sources and destinations
☆40Updated last year
scrapinghub / aduana
Frontera backend to guide a crawl using PageRank, HITS or other ranking algorithms based on the link structure of the web graph, even whe…
☆55Updated last year
CI-Research / KeywordAnalysis
Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends
☆58Updated last year
TeamHG-Memex / scrapy-dockerhub
[UNMAINTAINED] Deploy, run and monitor your Scrapy spiders.
☆11Updated 10 years ago
TeamHG-Memex / Formasaurus
Formasaurus tells you the type of an HTML form and its fields using machine learning
☆119Updated last year
cocrawler / cocrawler
CoCrawler is a versatile web crawler built using modern tools and concurrency.
☆189Updated 3 years ago
guillermo-carrasco / social_ids
Get user ids from social network handlers
☆12Updated 8 years ago
TeamHG-Memex / MaybeDont
A component that tries to avoid downloading duplicate content
☆27Updated 7 years ago
TeamHG-Memex / undercrawler
A generic crawler
☆78Updated 7 years ago
scrapinghub / mdr
A python library detect and extract listing data from HTML page.
☆108Updated 8 years ago
nik0spapp / sdalg
Web page segmentation and noise removal
☆55Updated last year
Parsely / serpextract
Easy extraction of keywords and engines from search engine results pages (SERPs).
☆92Updated 2 weeks ago
TeamHG-Memex / url-summary
Show summary of a large number of URLs in a Jupyter Notebook
☆17Updated 4 years ago
TeamHG-Memex / autologin-middleware
Scrapy middleware for the autologin
☆36Updated 7 years ago
nasa-jpl-memex / memex-gate
General Architecture for Text Engineering
☆49Updated 9 years ago
TeamHG-Memex / soft404
A classifier for detecting soft 404 pages
☆56Updated last week
openpreserve / pagelyzer
Suite of tools for detecting changes in web pages and their rendering
☆55Updated last year
asanoja / segmentations
Tools for web page segmentation. In development
☆17Updated 6 years ago
xtannier / WebAnnotator
WebAnnotator is a tool for annotating Web pages. WebAnnotator is implemented as a Firefox extension (https://addons.mozilla.org/en-US/fi…
☆48Updated 3 years ago
tasdikrahman / spammy
Spam filtering made easy for you
☆144Updated 6 years ago
tonywangcn / scaleable-crawler-with-docker-cluster
a scaleable and efficient crawelr with docker cluster , crawl million pages in 2 hours with a single machine
☆97Updated last year
scrapinghub / page_finder
Find which links on a web page are pagination links
☆29Updated 8 years ago
socialsensor / storm-focused-crawler
Collects multimedia content shared through social networks.
☆19Updated 10 years ago
TeamHG-Memex / autologin
A project to attempt to automatically login to a website given a single seed
☆127Updated last week
saymedia / seosuite
Automated Search Engine Optimization Testing Tool
☆81Updated 6 years ago
psolbach / metadoc
Aviation grade news article metadata extraction
☆36Updated 2 years ago
rmax / scrapy-boilerplate
Small set of utilities to simplify writing Scrapy spiders.
☆49Updated 10 years ago
mitll / topic-clustering
☆44Updated 9 years ago