The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)
☆225Dec 22, 2022Updated 3 years ago
Alternatives and similar repositories for commoncrawl-crawler
Users that are interested in commoncrawl-crawler are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- A library of examples showing how to use the Common Crawl corpus (2008-2012, ARC format)☆65Aug 5, 2016Updated 9 years ago
- Common Crawl fork of Apache Nutch☆41Updated this week
- Bixo is an open source web mining toolkit that runs as a series of Cascading pipes on top of Hadoop. By building a customized Cascading p…☆142Jul 7, 2022Updated 3 years ago
- Simple Samza Job Using Confluent Platform☆14Apr 14, 2016Updated 9 years ago
- Run cassandra inside a java project without bring server deps into client classpath☆32Feb 26, 2019Updated 7 years ago
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- DistributeCrawler的Maven版☆10Jun 20, 2022Updated 3 years ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆38Mar 12, 2026Updated 3 weeks ago
- News crawling with StormCrawler - stores content as WARC☆365Mar 31, 2026Updated last week
- SKOS Support for Apache Lucene and Solr☆56May 12, 2021Updated 4 years ago
- Examples and Slides for "Introduction to Spring for Apache Hadoop" at SpringOne2GX 2014☆16Jan 7, 2019Updated 7 years ago
- Role based access control☆14Sep 24, 2021Updated 4 years ago
- java分布式爬虫,主机和从机控制的机制☆14May 21, 2015Updated 10 years ago
- Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.☆38Aug 12, 2018Updated 7 years ago
- Distributed Realtime Search with Lucene and MongoDB☆60May 14, 2018Updated 7 years ago
- End-to-end encrypted cloud storage - Proton Drive • AdSpecial offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
- ☆12Dec 3, 2015Updated 10 years ago
- Read-only mirror. Please submit merge requests / issues to https://gitlab.com/libvirt/libvirt-sandbox☆13Aug 22, 2023Updated 2 years ago
- A fork of cascading patterns, but implemented for trident☆71Dec 16, 2023Updated 2 years ago
- Parse OCR result files for pagenos, tables of contents, etc.☆14Nov 30, 2011Updated 14 years ago
- A JMM Cookbook for Java Developers(as opposed to a cookbook for Compiler Writers)☆12Jun 13, 2014Updated 11 years ago
- WikiPBX is an open source PBX web interface for FreeSWITCH. WikiPBX is written in python and uses the Django web application framework. C…☆10May 30, 2014Updated 11 years ago
- ☆12Jan 4, 2023Updated 3 years ago
- Source for Reactive Architecture: Beyond the Basics online training course☆16Jul 26, 2017Updated 8 years ago
- A project meant to be a reference to get started building indexes for Solr with Hadoop's map-reduce.☆50Nov 9, 2016Updated 9 years ago
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- Collects multimedia content shared through social networks.☆19Feb 18, 2015Updated 11 years ago
- Automatic, zero-config web scraping -- written in Java, has no dependency on Java EE or app servers, and the web scraper has a restful/JS…☆156Jun 25, 2017Updated 8 years ago
- Pépin is a web image & video player with features like zoom, pan, comparisons, fullscreen, gapless videos playback, frame-by-frame scrubb…☆12Feb 3, 2017Updated 9 years ago
- Static site generator for Linked Data☆12Apr 4, 2026Updated last week
- A library for financial and time series calculations on Apache Spark☆28Feb 2, 2016Updated 10 years ago
- PredictionIO word2vec engine template (Scala-based parallelized engine)☆12Apr 22, 2015Updated 10 years ago
- Collect and disseminate information on fee-based Open Access publishing in Sweden☆11Updated this week
- 华南理工大学高英实验室进行的分布式爬虫项目,除了实验室内部人员外,不得私自传播.☆21Jul 13, 2014Updated 11 years ago
- Java audio routing and unit generator library.☆13Oct 13, 2020Updated 5 years ago
- End-to-end encrypted email - Proton Mail • AdSpecial offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
- The Scholix metadata schema is a set of properties describing a Link Information Package, which carries information about a link between …☆17Mar 14, 2022Updated 4 years ago
- The hub for all JATS4R meeting notes, examples, draft recommendations, documents, and issues.☆17Sep 8, 2019Updated 6 years ago
- Learning Spring 5.0, published by Packt☆10Oct 31, 2022Updated 3 years ago
- CSCI572: Information Retrieval and Web Search Engines☆10Jul 3, 2020Updated 5 years ago
- A distributed generic query layer for Apache Kafka Interactive Queries☆26Nov 8, 2017Updated 8 years ago
- Home of RDF2Go and RDFReactor☆13Jun 9, 2016Updated 9 years ago
- Official list of user agents that are regarded as robots/spiders by COUNTER☆72Apr 22, 2024Updated last year