gfjreg / CommonCrawlLinks

A distributed system for mining common crawl using SQS, AWS-EC2 and S3

☆21

Alternatives and similar repositories for CommonCrawl

Users that are interested in CommonCrawl are comparing it to the libraries listed below

Sorting:

TeamHG-Memex / autologin-middleware
Scrapy middleware for the autologin
☆36Updated 7 years ago
scrapinghub / mdr
A python library detect and extract listing data from HTML page.
☆108Updated 8 years ago
TeamHG-Memex / extract-html-diff
extract difference between two html pages
☆32Updated 7 years ago
scrapinghub / page_finder
Find which links on a web page are pagination links
☆29Updated 8 years ago
commoncrawl / gzipstream
gzipstream allows Python to process multi-part gzip files from a streaming source
☆23Updated 8 years ago
TeamHG-Memex / sitehound-frontend
Site Hound (previously THH) is a Domain Discovery Tool
☆23Updated 4 years ago
xtannier / WebAnnotator
WebAnnotator is a tool for annotating Web pages. WebAnnotator is implemented as a Firefox extension (https://addons.mozilla.org/en-US/fi…
☆48Updated 3 years ago
scrapinghub / exporters
Exporters is an extensible export pipeline library that supports filter, transform and several sources and destinations
☆40Updated last year
scrapinghub / webpager
Paginating the web
☆37Updated 11 years ago
nik0spapp / sdalg
Web page segmentation and noise removal
☆55Updated last year
istresearch / traptor
Traptor -- A distributed Twitter feed
☆26Updated 3 years ago
ikreymer / webarchive-indexing
Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.
☆46Updated 7 years ago
matthewruttley / mozclassify
Algorithms for URL Classification
☆19Updated 10 years ago
TeamHG-Memex / Formasaurus
Formasaurus tells you the type of an HTML form and its fields using machine learning
☆119Updated last year
Parsely / serpextract
Easy extraction of keywords and engines from search engine results pages (SERPs).
☆92Updated 2 weeks ago
larsmans / wordnet-domains-sentiwords
WordNet Domains, WordNet Affect and SentiWords
☆48Updated 9 years ago
mahmoud / wapiti
Wikipedia API wrapper for humans and elk. (en.wikipedia.org/w/api.php, get it?)
☆37Updated 11 years ago
openeventdata / scraper
Scrapes sites. Gets news. Eventually events.
☆85Updated 9 years ago
ClimbsRocks / empythy
Automated NLP sentiment predictions- batteries included, or use your own data
☆18Updated 7 years ago
wordnik / serapis
Serapis is a sentence identifier and modeling pipeline / built for Wordnik
☆24Updated 9 years ago
lethain / extraction
A Python library for extracting titles, images, descriptions and canonical urls from HTML.
☆151Updated 5 years ago
rkrzr / dataset-popular
A dataset of popular pages (taken from <dir.yahoo.com>) with manually marked up semantic blocks.
☆15Updated 11 years ago
cocrawler / cocrawler
CoCrawler is a versatile web crawler built using modern tools and concurrency.
☆189Updated 3 years ago
E3-JSI / newsfeed
A pipeline for crawling of RSS feeds and the associated content. Demo at newsfeed.ijs.si.
☆21Updated 12 years ago
trec-kba / streamcorpus
common data interchange format for document processing pipelines that apply natural language processing tools to large streams of text
☆35Updated 9 years ago
pydepta / pydepta
A python implementation of DEPTA
☆83Updated 8 years ago
TeamHG-Memex / url-summary
Show summary of a large number of URLs in a Jupyter Notebook
☆17Updated 4 years ago
TeamHG-Memex / MaybeDont
A component that tries to avoid downloading duplicate content
☆27Updated 7 years ago
TeamHG-Memex / scrapy-crawl-once
Scrapy middleware which allows to crawl only new content
☆79Updated 2 years ago
adamfabish / Reduction
Reduction is a python script which automatically summarizes a text by extracting the sentences which are deemed to be most important.
☆54Updated 10 years ago