Smerity / cc-warc-examples
CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop
☆56Updated 3 years ago
Related projects ⓘ
Alternatives and complementary repositories for cc-warc-examples
- Mirror of Apache Stanbol (incubating)☆112Updated 8 months ago
- WARC (Web Archive) Input and Output Formats for Hadoop☆35Updated 9 years ago
- Additional opennlp mapping type for elasticsearch in order to perform named entity recognition☆136Updated 8 years ago
- Solr Dictionary Annotator (Microservice for Spark)☆70Updated 4 years ago
- Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.☆281Updated 6 years ago
- Common web archive utility code.☆50Updated last month
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆212Updated last year
- ☆184Updated 6 years ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆37Updated 4 months ago
- A text tagger based on Lucene / Solr, using FST technology☆176Updated 11 months ago
- Elasticsearch Latent Semantic Indexing experimentation☆33Updated 5 years ago
- SKOS Support for Apache Lucene and Solr☆56Updated 3 years ago
- Warcbase is an open-source platform for managing analyzing web archives☆162Updated 6 years ago
- Combines Apache OpenNLP and Apache Tika and provides facilities for automatically deriving sentiment from text.☆32Updated last year
- A toolkit that wraps various natural language processing implementations behind a common interface.☆101Updated 7 years ago
- A set of reusable Java components that implement functionality common to any web crawler☆237Updated this week
- Dice Solr Plugins from Simon Hughes Dice.com☆87Updated 3 years ago
- Mirror of Apache Lucene + Solr☆48Updated 4 years ago
- Building recommenders with Elastic Graph!☆37Updated 4 years ago
- The WikiBrain Java library enables researchers and developers to incorporate state-of-the-art Wikipedia-based algorithms and technologies…☆91Updated 6 years ago
- Analytic UIMA pipelines using Spark☆23Updated 8 years ago
- Parse wikipedia dumps and index (some) page data to elasticsearch☆49Updated 9 years ago
- SIREn - Semi-Structured Information Retrieval Engine☆107Updated 3 years ago
- A Query Autofiltering SearchComponent for Solr that can translate free-text queries into structured queries using index metadata☆28Updated 6 years ago
- Integration between Stanford NLP and Apache Stanbol☆33Updated 8 years ago
- ☆47Updated 7 years ago
- Using latent Dirichlet allocation (LDA) in Apache Lucene☆58Updated 12 years ago
- Stanford CoreNLP NER addon for Apache Tika's NamerEntityParser☆13Updated 2 years ago
- Fusion demo app searching open-source project data from the Apache Software Foundation☆42Updated 6 years ago
- Search relevance evaluation toolkit☆73Updated 2 years ago