RovoMe / JIRLbot
Java implementation of the Internet Research Lab Web Crawler (IRLbot) as presented by Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov in their paper "IRLbot: Scaling to 6 Billion Pages and Beyond"
☆16Updated 7 years ago
Related projects: ⓘ
- The LAW next generation crawler.☆85Updated 2 years ago
- API definition, resources and reference implementation of URL Frontiers☆44Updated last week
- A distributed in-memory key-value storage for billions of small objects.☆23Updated 5 years ago
- SOLR bulk indexing utility for the command line.☆45Updated last month
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆214Updated last year
- A text tagger based on Lucene / Solr, using FST technology☆173Updated 9 months ago
- Solr Dictionary Annotator (Microservice for Spark)☆70Updated 4 years ago
- Lightning fast spell correction / fuzzy search library based on SymSpell by Commerce-Experts☆80Updated 6 years ago
- A language detection Web Service☆52Updated 7 years ago
- Query preprocessor for Java-based search engines (Querqy Core and Solr implementation)☆183Updated this week
- The JSON database for REST and Websocket storage☆42Updated 9 years ago
- Elasticsearch plugin for b-bit minhash algorism☆62Updated 3 months ago
- Search Management UI☆52Updated last week
- A set of reusable Java components that implement functionality common to any web crawler☆233Updated last month
- Browser-driven explorer for lucene indexes☆72Updated 3 years ago
- An elasticsearch plugin to create hierarchical aggregations☆51Updated 6 months ago
- Common web archive utility code.☆50Updated last week
- A new solr multilingual index and search architecture, it can support index and search across multiple languages at the same time in the …☆13Updated 4 years ago
- A fast and comprehensive Java library capable of performing automaton and non-automaton based Levenshtein distance determination and neig…☆41Updated 11 years ago
- The next generation of open source search☆90Updated 7 years ago
- Lucene plugin for indexing and searching files stored in Baratine distributed filesystem☆16Updated 8 years ago
- A vector similarity database☆231Updated 10 years ago
- Common Crawl Index Server☆65Updated 8 months ago
- Github mirror of "search/highlighter" - our actual code is hosted with Gerrit (please see https://www.mediawiki.org/wiki/Developer_access…☆100Updated 4 months ago
- Lucene Directory implementation for AWS S3☆37Updated 2 years ago
- Distributed processing framework for search solutions☆81Updated last year
- A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-test…☆61Updated last month
- Extremely fast and compact in-memory embedded column oriented database☆18Updated 7 years ago
- This module contains an implementation of the Nilsimsa locality-sensitive hashing algorithm in Java.☆18Updated 5 years ago
- This plugin provides a useful feature for multi-language☆13Updated 2 years ago