RovoMe / JIRLbot
Java implementation of the Internet Research Lab Web Crawler (IRLbot) as presented by Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov in their paper "IRLbot: Scaling to 6 Billion Pages and Beyond"
☆16Updated 7 years ago
Alternatives and similar repositories for JIRLbot
Users that are interested in JIRLbot are comparing it to the libraries listed below
Sorting:
- API definition, resources and reference implementation of URL Frontiers☆48Updated 2 weeks ago
- Apache Commons RDF☆47Updated last week
- Apache OpenNLP Sandbox☆43Updated this week
- SKOS Support for Apache Lucene and Solr☆56Updated 4 years ago
- Solr Redis Extensions☆52Updated last year
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆215Updated 2 years ago
- Highly performant, lightweight framework for linked data processing. Supports RDFa, JSON-LD, RDF/XML and plain text formats, runs on Andr…☆52Updated 2 years ago
- Github mirror of "search/highlighter" - our actual code is hosted with Gerrit (please see https://www.mediawiki.org/wiki/Developer_access…☆103Updated last week
- Ingest processor doing language detection for fields☆72Updated 2 years ago
- Elasticsearch plugin for b-bit minhash algorism☆63Updated 11 months ago
- Apache Anything To Triples (Any23) is a library, a web service and a command line tool that extracts structured data in RDF format from a…☆96Updated last year
- Let's try batching some cypher queries☆11Updated 9 years ago
- The LAW next generation crawler.☆87Updated 3 years ago
- Mirror of Apache Marmotta☆54Updated 5 years ago
- SOLR bulk indexing utility for the command line.☆45Updated last month
- Solr AutoComplete implementation☆59Updated 7 years ago
- Common web archive utility code.☆55Updated 2 months ago
- A text tagger based on Lucene / Solr, using FST technology☆176Updated last year
- Java parsers for different RDF serialisations + API + tools + JAX-RS integration☆20Updated 3 years ago
- Migrate Redis data from source to destination☆9Updated 4 years ago
- Java OWL Persistence API☆36Updated 2 weeks ago
- A set of reusable Java components that implement functionality common to any web crawler☆244Updated 3 weeks ago
- WInte.r is a Java framework for end-to-end data integration. The WInte.r framework implements well-known methods for data pre-processing,…☆110Updated 2 years ago
- Java library for reading and writing WARC files with a typed API☆48Updated 4 months ago
- Query preprocessor for Java-based search engines (Querqy Core and Solr implementation)☆184Updated last week
- Write JDBC ResultSet to Parquet File☆11Updated last month
- Asynchronous search makes it possible for users to run queries in the background, allowing users to track the progress, and retrieve par…☆23Updated 4 years ago
- Distributed processing framework for search solutions☆81Updated 2 years ago
- An RDF plugin for Solr☆114Updated 3 months ago
- Extension of the rdf3x engine and the translatesparql tool.☆45Updated 11 years ago