The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)
☆225Dec 22, 2022Updated 3 years ago
Alternatives and similar repositories for commoncrawl-crawler
Users that are interested in commoncrawl-crawler are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Simple Samza Job Using Confluent Platform☆14Apr 14, 2016Updated 10 years ago
- Run cassandra inside a java project without bring server deps into client classpath☆32Feb 26, 2019Updated 7 years ago
- DistributeCrawler的Maven版☆10Jun 20, 2022Updated 3 years ago
- Blog crawler for the blogforever project.☆23Jan 31, 2014Updated 12 years ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆38Mar 12, 2026Updated last month
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- A set of reusable Java components that implement functionality common to any web crawler☆256Updated this week
- News crawling with StormCrawler - stores content as WARC☆366Apr 21, 2026Updated last week
- SKOS Support for Apache Lucene and Solr☆56May 12, 2021Updated 4 years ago
- Graph algorithms implemented in GraphX and Spark styles☆15Apr 26, 2015Updated 11 years ago
- java分布式爬虫,主机和从机控制的机制☆14May 21, 2015Updated 10 years ago
- Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.☆38Aug 12, 2018Updated 7 years ago
- Distributed Realtime Search with Lucene and MongoDB☆61May 14, 2018Updated 7 years ago
- ☆12Dec 3, 2015Updated 10 years ago
- A fork of cascading patterns, but implemented for trident☆71Dec 16, 2023Updated 2 years ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- Camus Compressor merges files created by Camus and saves them in a compressed format.☆13Mar 20, 2023Updated 3 years ago
- A JMM Cookbook for Java Developers(as opposed to a cookbook for Compiler Writers)☆12Jun 13, 2014Updated 11 years ago
- WikiPBX is an open source PBX web interface for FreeSWITCH. WikiPBX is written in python and uses the Django web application framework. C…☆10May 30, 2014Updated 11 years ago
- Web archiving utility library☆11Mar 11, 2026Updated last month
- Example Node.js application demonstrating Cucumber.js usages☆42Jun 3, 2013Updated 12 years ago
- bigram / trigram analysis of wikipedia; mainly mutual info☆22Mar 6, 2012Updated 14 years ago
- Source for Reactive Architecture: Beyond the Basics online training course☆16Jul 26, 2017Updated 8 years ago
- Collects multimedia content shared through social networks.☆19Feb 18, 2015Updated 11 years ago
- 🎙️ The easiest way to explore and manipulate your CI Pipelines in all of your FluentCI projects.☆20Aug 30, 2024Updated last year
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Static site generator for Linked Data☆12Apr 12, 2026Updated 2 weeks ago
- A library for financial and time series calculations on Apache Spark☆28Feb 2, 2016Updated 10 years ago
- Tool for visualizing Apache Oozie pipelines☆13Feb 15, 2016Updated 10 years ago
- PredictionIO word2vec engine template (Scala-based parallelized engine)☆12Apr 22, 2015Updated 11 years ago
- Repository containing scripts for importing OpenAlex snapshots into BigQuery☆15Mar 6, 2026Updated last month
- Collect and disseminate information on fee-based Open Access publishing in Sweden☆11Apr 17, 2026Updated 2 weeks ago
- MSAM是一个API接口文档管理器,用于生成兼容Swagger.json的接口文件的接口管理软件本项目已经停止运维,请使用升级版☆21May 23, 2023Updated 2 years ago
- 华南理工大学高英实验室进行的分布式爬虫项目,除了实验室内部人员外,不得私自传播.☆21Jul 13, 2014Updated 11 years ago
- DC/OS community content☆11May 16, 2018Updated 7 years ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Java audio routing and unit generator library.☆13Oct 13, 2020Updated 5 years ago
- The Scholix metadata schema is a set of properties describing a Link Information Package, which carries information about a link between …☆17Mar 14, 2022Updated 4 years ago
- Collection of generic Apache Flink operators☆17May 15, 2017Updated 8 years ago
- A java library for creating standalone, portable, schema-full object databases supporting pagination and faceted search, and offering str…☆17Mar 17, 2017Updated 9 years ago
- DATS JSON schemas☆13Dec 21, 2022Updated 3 years ago
- Notes and cheat sheets on various topics☆25Dec 22, 2022Updated 3 years ago
- A distributed generic query layer for Apache Kafka Interactive Queries☆26Nov 8, 2017Updated 8 years ago