Smerity / cc-warc-examples
CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop
☆56Updated 3 years ago
Alternatives and similar repositories for cc-warc-examples:
Users that are interested in cc-warc-examples are comparing it to the libraries listed below
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆37Updated last month
- Mirror of Apache Stanbol (incubating)☆112Updated 11 months ago
- Common web archive utility code.☆52Updated last month
- Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.☆281Updated 6 years ago
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆212Updated 2 years ago
- ☆48Updated 7 years ago
- Launch AWS Elastic MapReduce jobs that process Common Crawl data.☆49Updated 7 years ago
- WARC (Web Archive) Input and Output Formats for Hadoop☆35Updated 10 years ago
- RDF store on a cloud-based architecture (previously on https://code.google.com/p/cumulusrdf)☆31Updated 8 years ago
- A toolkit that wraps various natural language processing implementations behind a common interface.☆101Updated 7 years ago
- Warcbase is an open-source platform for managing analyzing web archives☆162Updated 7 years ago
- General Architecture for Text Engineering☆46Updated 8 years ago
- Fusion demo app searching open-source project data from the Apache Software Foundation☆42Updated 6 years ago
- A text tagger based on Lucene / Solr, using FST technology☆176Updated last year
- Additional opennlp mapping type for elasticsearch in order to perform named entity recognition☆136Updated 8 years ago
- Solr Dictionary Annotator (Microservice for Spark)☆71Updated 4 years ago
- ☆184Updated 6 years ago
- SKOS Support for Apache Lucene and Solr☆56Updated 3 years ago
- The WikiBrain Java library enables researchers and developers to incorporate state-of-the-art Wikipedia-based algorithms and technologies…☆91Updated 6 years ago
- SIREn - Semi-Structured Information Retrieval Engine☆107Updated 3 years ago
- ☆28Updated 8 years ago
- an open-source data management platform for knowledge workers (https://github.com/dswarm/dswarm-documentation/wiki)☆54Updated 7 years ago
- A set of reusable Java components that implement functionality common to any web crawler☆240Updated last month
- Combines Apache OpenNLP and Apache Tika and provides facilities for automatically deriving sentiment from text.☆33Updated last year
- SKOS analysis for Elasticsearch☆54Updated 8 years ago
- Dice.com tutorial on using black box optimization algorithms to do relevancy tuning on your Solr Search Engine Configuration from Simon H…☆28Updated 5 years ago
- A Query Autofiltering SearchComponent for Solr that can translate free-text queries into structured queries using index metadata☆28Updated 6 years ago
- An RDF plugin for Solr☆114Updated this week
- Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.☆38Updated 6 years ago
- Elasticsearch Latent Semantic Indexing experimentation☆33Updated 5 years ago