devvid / python-common-crawl-amazon-exampleLinks

Exploring Common-Crawl using Python and DynamoDB

☆33

Alternatives and similar repositories for python-common-crawl-amazon-example

Users that are interested in python-common-crawl-amazon-example are comparing it to the libraries listed below

Sorting:

CI-Research / KeywordAnalysis
Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends
☆56Updated last year
dcondrey / scrapy-spiders
Collection of python scripts I have created to crawl various websites, mostly for lead generation projects to match keywords and collect …
☆131Updated last year
hellpanderrr / linkedin-pdf-parsing
Parsing resumes in a PDF format from linkedIn
☆68Updated 8 years ago
TeamHG-Memex / deep-deep
Adaptive crawler which uses Reinforcement Learning methods
☆169Updated 7 years ago
TeamHG-Memex / autologin
A project to attempt to automatically login to a website given a single seed
☆124Updated 2 years ago
HyperionGray / starbelly
Streaming web crawler with WebSocket API
☆44Updated last year
jxltom / scrapymon
Simple Web UI for Scrapy spider management via Scrapyd
☆51Updated 6 years ago
gfjreg / CommonCrawl
A distributed system for mining common crawl using SQS, AWS-EC2 and S3
☆21Updated 10 years ago
cldellow / real-estate-prices-cc
Source real estate prices from the Common Crawl.
☆27Updated 6 years ago
scrapinghub / webstruct
NER toolkit for HTML data
☆259Updated last year
DusanMadar / ScrapeMeAgain
Yet another Python web scraping application
☆30Updated 5 years ago
TeamHG-Memex / scrapy-crawl-once
Scrapy middleware which allows to crawl only new content
☆79Updated 2 years ago
NikolaiT / scrapeulous
Cloud crawler functions for scrapeulous
☆45Updated 4 years ago
scrapinghub / scrapy-autoextract
Zyte Automatic Extraction integration for Scrapy
☆56Updated 3 years ago
TeamHG-Memex / MaybeDont
A component that tries to avoid downloading duplicate content
☆27Updated 7 years ago
iamtodor / angel.co-companies-list-scraping
☆62Updated last year
TeamHG-Memex / domain-discovery-crawler
Broad crawler for domain discovery
☆19Updated 7 years ago
scrapinghub / scaws
Extensions for using Scrapy on Amazon AWS
☆32Updated 12 years ago
Parsely / serpextract
Easy extraction of keywords and engines from search engine results pages (SERPs).
☆90Updated 3 years ago
kororo / excelcy
Excel Integration with spaCy. Training NER using Excel/XLSX from PDF, DOCX, PPT, PNG or JPG.
☆105Updated 2 years ago
TeamHG-Memex / sitehound-frontend
Site Hound (previously THH) is a Domain Discovery Tool
☆23Updated 4 years ago
scrapinghub / page_clustering
A simple algorithm for clustering web pages, suitable for crawlers
☆34Updated 8 years ago
anuragrana / scraping_tweets_celery_rabbitmq_docker_cluster
Scraping tweets quickly using celery, RabbitMQ and Docker cluster
☆48Updated 2 years ago
invana / crawlerflow
Web Crawlers orchestration framework that lets you create datasets from multiple web sources using yaml configurations.
☆34Updated last year
commoncrawl / cc-mrjob
Demonstration of using Python to process the Common Crawl dataset with the mrjob framework
☆165Updated 3 years ago
tpeng / googlesearch
Scrape the Google search result with Scrapy.
☆98Updated 5 years ago
scrapinghub / scrapy-frontera
More flexible and featured Frontera scheduler for Scrapy
☆37Updated 6 months ago
apilayer / scrapestack
Real-Time Proxy & Web Scraping API
☆24Updated 5 years ago
openeventdata / scraper
Scrapes sites. Gets news. Eventually events.
☆87Updated 9 years ago
BernhardWenzel / google-taxonomy-matcher
Matches a category of Google's Taxonomy to product that is described in any kind of text data
☆62Updated 6 years ago