commoncrawl/commoncrawl-crawler

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/commoncrawl/commoncrawl-crawler)

commoncrawl / commoncrawl-crawler

The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)

☆226

Alternatives and similar repositories for commoncrawl-crawler

Users that are interested in commoncrawl-crawler are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

commoncrawl / commoncrawl-examples
View on GitHub
A library of examples showing how to use the Common Crawl corpus (2008-2012, ARC format)
☆66Aug 5, 2016Updated 9 years ago
commoncrawl / commoncrawl
View on GitHub
Common Crawl support library to access 2008-2012 crawl archives (ARC files)
☆508Nov 29, 2017Updated 8 years ago
Aloisius / nutch
View on GitHub
CommonCrawl Test version of Nutch
☆16Jul 10, 2014Updated 12 years ago
trivio / common_crawl_index
View on GitHub
Index URLs in Common Crawl
☆197Sep 19, 2017Updated 8 years ago
DigitalPebble / behemoth
View on GitHub
Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
☆282Apr 25, 2018Updated 8 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
matpalm / common-crawl
View on GitHub
playing around with the common crawl dataset
☆70Aug 18, 2012Updated 13 years ago
nathanmarz / trident-kafka
View on GitHub
NOTE: This project has been moved into storm-kafka in storm-contrib
☆15Nov 2, 2012Updated 13 years ago
quintona / storm-pattern
View on GitHub
A fork of cascading patterns, but implemented for trident
☆71Dec 16, 2023Updated 2 years ago
imotov / elasticsearch-facet-script
View on GitHub
Fully Scriptable Facets for ElasticSearch
☆50Aug 8, 2013Updated 12 years ago
OpenDDRdotORG / OpenDDR-Java
View on GitHub
Java Implementation of OpenDDR-Simple-API
☆43Mar 28, 2014Updated 12 years ago
socialsensor / storm-focused-crawler
View on GitHub
Collects multimedia content shared through social networks.
☆19Feb 18, 2015Updated 11 years ago
crawler-commons / crawler-commons
View on GitHub
A set of reusable Java components that implement functionality common to any web crawler
☆259Updated this week
YahooArchive / samoa
View on GitHub
SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams.
☆427Mar 28, 2016Updated 10 years ago
hdkmraf / JSONtoNeo4j
View on GitHub
Create neo4j graphs from JSON files
☆15Aug 19, 2012Updated 13 years ago
End-to-end encrypted cloud storage - Proton Drive • Ad
Special offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
matpalm / collocations
View on GitHub
bigram / trigram analysis of wikipedia; mainly mutual info
☆22Mar 6, 2012Updated 14 years ago
behas / lucene-skos
View on GitHub
SKOS Support for Apache Lucene and Solr
☆55May 12, 2021Updated 5 years ago
play-co / hermes
View on GitHub
Clojure wrapper for Titan
☆38Feb 17, 2013Updated 13 years ago
allegro / camus-compressor
View on GitHub
Camus Compressor merges files created by Camus and saves them in a compressed format.
☆13Mar 20, 2023Updated 3 years ago
Cascading / CoPA
View on GitHub
Cascading plus City of Palo Alto open data
☆29Mar 3, 2013Updated 13 years ago
theduderog / hello-samza-confluent
View on GitHub
Simple Samza Job Using Confluent Platform
☆14Apr 14, 2016Updated 10 years ago
datasalt / pangool
View on GitHub
Tuple MapReduce for Hadoop: Hadoop API made easy
☆57Jun 27, 2022Updated 4 years ago
nathanmarz / storm-deploy
View on GitHub
One click deploy for Storm clusters on AWS
☆513Jul 21, 2015Updated 11 years ago
scrapy / slybot
View on GitHub
☆224Apr 27, 2015Updated 11 years ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
nathanmarz / storm
View on GitHub
Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more
☆8,770Aug 16, 2017Updated 8 years ago
commoncrawl / news-crawl
View on GitHub
News crawling with StormCrawler - stores content as WARC
☆375Updated this week
twitter / elephant-bird
View on GitHub
Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.
☆1,134Apr 10, 2023Updated 3 years ago
bixo / bixo
View on GitHub
Bixo is an open source web mining toolkit that runs as a series of Cascading pipes on top of Hadoop. By building a customized Cascading p…
☆143Jul 7, 2022Updated 4 years ago
bhavishya235 / Web-Classification
View on GitHub
This project deals with hierarchical classification of web pages based on dmoz dataset.
☆14Apr 10, 2014Updated 12 years ago
lintool / Cloud9
View on GitHub
Cloud9 is a Hadoop toolkit for working with big data
☆237Dec 15, 2015Updated 10 years ago
gsh199449 / DistributedCrawler
View on GitHub
DistributeCrawler的Maven版
☆10Jun 20, 2022Updated 4 years ago
datawrangling / trendingtopics
View on GitHub
Rails app for tracking trends in server logs - powered by the Cloudera Hadoop Distribution on EC2
☆359Aug 1, 2011Updated 14 years ago
ssalevan / cc-helloworld
View on GitHub
CommonCrawl Hello World example
☆33Jun 25, 2014Updated 12 years ago
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
rossf7 / wikireverse
View on GitHub
Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.
☆38Aug 12, 2018Updated 7 years ago
larsmans / lucene-stanford-lemmatizer
View on GitHub
A library that adds some NLP capabilities to the Lucene search engine
☆50Jul 16, 2013Updated 13 years ago
medcl / elasticsearch-carrot2
View on GitHub
a elasticsearch plugin integrated with carrot2,which clustering your search results into topics,
☆47Jun 3, 2013Updated 13 years ago
LinkedInAttic / camus
View on GitHub
LinkedIn's previous generation Kafka to HDFS pipeline.
☆879Aug 27, 2020Updated 5 years ago
commoncrawl / cc-warc-examples
View on GitHub
CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop
☆38Jun 30, 2026Updated last month
h2oai / h2o-2
View on GitHub
Please visit https://github.com/h2oai/h2o-3 for latest H2O
☆2,254Oct 24, 2024Updated last year
ning / meteo
View on GitHub
Realtime Analytics
☆41Mar 27, 2012Updated 14 years ago