Smerity / cc-warc-examplesLinks

CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop

☆57

Alternatives and similar repositories for cc-warc-examples

Users that are interested in cc-warc-examples are comparing it to the libraries listed below

Sorting:

iipc / webarchive-commons
Common web archive utility code.
☆56Updated 2 months ago
DigitalPebble / behemoth
Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
☆283Updated 7 years ago
apache / stanbol
Mirror of Apache Stanbol (incubating)
☆114Updated last year
elsevierlabs-os / soda
Solr Dictionary Annotator (Microservice for Spark)
☆71Updated 5 years ago
lintool / warcbase
Warcbase is an open-source platform for managing analyzing web archives
☆162Updated 7 years ago
fergiemcdowall / solrstrap
Solrstrap is a Query-Result interface for Solr written in JavaScript, HTML and CSS
☆87Updated 8 years ago
OpenSextant / SolrTextTagger
A text tagger based on Lucene / Solr, using FST technology
☆177Updated last year
commoncrawl / cc-warc-examples
CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop
☆37Updated 10 months ago
diegoceccarelli / json-wikipedia
Json Wikipedia, contains code to convert the Wikipedia xml dump into a json/avro dump
☆254Updated last year
paulhoule / infovore
RDF-Centric Map/Reduce Framework and Freebase data conversion tool
☆148Updated 3 years ago
Wikidata / primarysources
Approve or reject statements from third-party datasets
☆146Updated 7 years ago
USCDataScience / SentimentAnalysisParser
Combines Apache OpenNLP and Apache Tika and provides facilities for automatically deriving sentiment from text.
☆34Updated 2 years ago
dswarm / dswarm
an open-source data management platform for knowledge workers (https://github.com/dswarm/dswarm-documentation/wiki)
☆54Updated 7 years ago
behas / lucene-skos
SKOS Support for Apache Lucene and Solr
☆56Updated 4 years ago
opensemanticsearch / solr-ontology-tagger
Automatic tagging and analysis of documents in an Apache Solr index for faceted search by RDF(S) Ontologies & SKOS thesauri
☆47Updated 3 years ago
crawler-commons / crawler-commons
A set of reusable Java components that implement functionality common to any web crawler
☆246Updated 3 weeks ago
cumulusrdf / cumulusrdf
RDF store on a cloud-based architecture (previously on https://code.google.com/p/cumulusrdf)
☆31Updated 9 years ago
sematext / solr-autocomplete
Solr AutoComplete implementation
☆59Updated 8 years ago
dkpro / dkpro-tc
UIMA-based text classification framework built on top of DKPro Core and DKPro Lab.
☆35Updated 2 years ago
ziqizhang / jate
NEWS: JATE2.0 Beta.11 Released, see details below.
☆82Updated 2 years ago
shilad / wikibrain
The WikiBrain Java library enables researchers and developers to incorporate state-of-the-art Wikipedia-based algorithms and technologies…
☆95Updated 7 years ago
jrvosse / wordnet-3.0-rdf
The linked open dataset described at http://datahub.io/dataset/vu-wordnet, and the tools used to create it
☆25Updated 5 years ago
o19s / match-query-parser
Search a single field with different query time analyzers in Solr
☆25Updated 5 years ago
o19s / Spyglass
Simple search results with Solr and EmberJS
☆58Updated 6 years ago
DiceTechJobs / SolrPlugins
Dice Solr Plugins from Simon Hughes Dice.com
☆88Updated 4 years ago
spinscale / elasticsearch-opennlp-plugin
Additional opennlp mapping type for elasticsearch in order to perform named entity recognition
☆136Updated 9 years ago
lucidworks / query-autofiltering-component
A Query Autofiltering SearchComponent for Solr that can translate free-text queries into structured queries using index metadata
☆27Updated 7 years ago
spaziocodice / SolRDF
An RDF plugin for Solr
☆115Updated 8 months ago
zelandiya / maui
☆185Updated 6 years ago
opencog / relex
English Dependency Relationship Extractor
☆85Updated 10 months ago