RovoMe / JIRLbot
Java implementation of the Internet Research Lab Web Crawler (IRLbot) as presented by Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov in their paper "IRLbot: Scaling to 6 Billion Pages and Beyond"
☆16Updated 7 years ago
Alternatives and similar repositories for JIRLbot:
Users that are interested in JIRLbot are comparing it to the libraries listed below
- API definition, resources and reference implementation of URL Frontiers☆48Updated 2 weeks ago
- The LAW next generation crawler.☆87Updated 3 years ago
- Lucene Directory implementation for AWS S3☆41Updated last month
- Ingest processor doing language detection for fields☆72Updated 2 years ago
- Production-ready Java implementation of the Xor Filter.☆17Updated 5 years ago
- A text tagger based on Lucene / Solr, using FST technology☆176Updated last year
- Sux4J is an effort to bring succinct data structures to Java.☆161Updated last year
- Github mirror of "search/extra" - our actual code is hosted with Gerrit (please see https://www.mediawiki.org/wiki/Developer_access for c…☆53Updated last month
- Solr Dictionary Annotator (Microservice for Spark)☆71Updated 5 years ago
- Mirror of Apache OpenNLP Add-ons☆17Updated this week
- Apache NLPCraft - API to convert natural language into actions.☆79Updated last month
- Zulia Search Engine☆32Updated last week
- Apache OpenNLP Sandbox☆42Updated this week
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆213Updated 2 years ago
- Tools for finite state automata construction and dictionary-based morphological dictionaries. Includes Polish stemming dictionary.☆189Updated last year
- Pure Java implementations of Murmur hash algorithms☆73Updated last year
- A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and …☆48Updated 3 years ago
- The GATE Embedded core API and GATE Developer application☆82Updated 4 months ago
- Elasticsearch plugin for b-bit minhash algorism☆62Updated 9 months ago
- Browser-driven explorer for lucene indexes☆74Updated 3 years ago
- SOLR bulk indexing utility for the command line.☆45Updated 3 weeks ago
- A Java library capable of constructing character-sequence-storing, directed acyclic graphs of minimal size☆43Updated 11 years ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆56Updated 3 years ago
- The Sweble Wikitext Components module provides a parser for MediaWiki's wikitext and an engine trying to emulate the behavior of a MediaW…☆71Updated 11 months ago
- Benchmarks for the RediSearch module☆44Updated 2 years ago
- Implementation of Vision Based Page Segmentation algorithm in Java☆101Updated 5 years ago
- An elasticsearch plugin to create hierarchical aggregations☆51Updated last week
- A high performance "thin wrapper" HTTP REST server on top of Apache Lucene☆143Updated 10 months ago
- Analyzes chronological patterns present in time-series data and provides human-readable descriptions☆24Updated 2 years ago
- Various utilities regarding Levenshtein transducers. (Java)☆57Updated 3 years ago