The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)
☆223Dec 22, 2022Updated 3 years ago
Alternatives and similar repositories for commoncrawl-crawler
Users that are interested in commoncrawl-crawler are comparing it to the libraries listed below
Sorting:
- Common Crawl fork of Apache Nutch☆40Updated this week
- FoGFaaS: Add serverless computing (faas) to ifogsim☆22Mar 30, 2025Updated 11 months ago
- A library of examples showing how to use the Common Crawl corpus (2008-2012, ARC format)☆65Aug 5, 2016Updated 9 years ago
- Graph algorithms implemented in GraphX and Spark styles☆15Apr 26, 2015Updated 10 years ago
- Simple Samza Job Using Confluent Platform☆14Apr 14, 2016Updated 9 years ago
- Run cassandra inside a java project without bring server deps into client classpath☆32Feb 26, 2019Updated 7 years ago
- bigram / trigram analysis of wikipedia; mainly mutual info☆22Mar 6, 2012Updated 13 years ago
- DistributeCrawler的Maven版☆10Jun 20, 2022Updated 3 years ago
- Blog crawler for the blogforever project.☆23Jan 31, 2014Updated 12 years ago
- Distributed Realtime Search with Lucene and MongoDB☆60May 14, 2018Updated 7 years ago
- A distributed key-value storage system built over RocksDB☆15Dec 30, 2016Updated 9 years ago
- Source for Reactive Architecture: Beyond the Basics online training course☆16Jul 26, 2017Updated 8 years ago
- Camus Compressor merges files created by Camus and saves them in a compressed format.☆13Mar 20, 2023Updated 2 years ago
- ☆14Mar 29, 2016Updated 9 years ago
- A set of reusable Java components that implement functionality common to any web crawler☆254Updated this week
- java分布式爬虫,主机和从机控制的机制☆14May 21, 2015Updated 10 years ago
- A fork of cascading patterns, but implemented for trident☆71Dec 16, 2023Updated 2 years ago
- Collection of generic Apache Flink operators☆17May 15, 2017Updated 8 years ago
- MILBoost and other boosting algorithms, compatible with scikit-learn☆14Nov 15, 2024Updated last year
- Java EE Cache Filter☆36Mar 15, 2019Updated 6 years ago
- Json Wikipedia, contains code to convert the Wikipedia xml dump into a json dump. Questions? https://gitter.im/idio-opensource/Lobby☆17May 20, 2022Updated 3 years ago
- ☆20Apr 29, 2016Updated 9 years ago
- 华南理工大学高英实验室进行的分布式爬虫项目,除了实验室内部人员外,不得私自传播.☆21Jul 13, 2014Updated 11 years ago
- conbine flume,spark-streaming and redis for real-time computing☆22Oct 20, 2014Updated 11 years ago
- Experiments in Streaming☆60Aug 27, 2016Updated 9 years ago
- ☆25Updated this week
- All solutions that we have for competitive Programming websites.☆21Feb 20, 2017Updated 9 years ago
- 在工作中和各种scala培训中积累的代码片段☆55May 4, 2019Updated 6 years ago
- Project accompanying Akka Notes - Part 1 (Fire and forget Messaging)☆26Sep 30, 2020Updated 5 years ago
- Real-time aggregation of metrics from large distributed systems.☆107Nov 6, 2018Updated 7 years ago
- 使用spark streaming 导入kafka数据到hbase☆25Apr 14, 2016Updated 9 years ago
- network visualization of Reddit discussions☆104Aug 8, 2013Updated 12 years ago
- A implementation of Facebook's bigpipe for the Java web platform☆24Jul 2, 2015Updated 10 years ago
- Apache Spark Scala utility to track data records during application execution☆11Jun 12, 2023Updated 2 years ago
- Code for COVID19 CT labeling. Submillimetric CT dataset provided as well.☆13Feb 16, 2021Updated 5 years ago
- Yet another Node.js web framework, based on koa.js 又一个 Node.js MVC 框架,基于Koa2☆11May 12, 2017Updated 8 years ago
- A scalable, mature and versatile web crawler based on Apache Storm☆966Updated this week
- ☆29May 15, 2015Updated 10 years ago
- A library for financial and time series calculations on Apache Spark☆28Feb 2, 2016Updated 10 years ago