RovoMe / JIRLbotLinks
Java implementation of the Internet Research Lab Web Crawler (IRLbot) as presented by Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov in their paper "IRLbot: Scaling to 6 Billion Pages and Beyond"
☆16Updated 8 years ago
Alternatives and similar repositories for JIRLbot
Users that are interested in JIRLbot are comparing it to the libraries listed below
Sorting:
- The LAW next generation crawler.☆87Updated 3 years ago
- API definition, resources and reference implementation of URL Frontiers☆50Updated 3 weeks ago
- A set of reusable Java components that implement functionality common to any web crawler☆244Updated 2 weeks ago
- Github mirror of "search/extra" - our actual code is hosted with Gerrit (please see https://www.mediawiki.org/wiki/Developer_access for c…☆54Updated 3 weeks ago
- A text tagger based on Lucene / Solr, using FST technology☆176Updated last year
- command line tool for Apache Lucene☆163Updated last month
- Entity resolution for Elasticsearch.☆161Updated 6 months ago
- Query preprocessor for Java-based search engines (Querqy Core and Lucene implementation)☆184Updated 2 weeks ago
- Elasticsearch plugin for nearest neighbor search. Store vectors and run similarity search using exact and approximate algorithms.☆386Updated this week
- Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or fi…☆193Updated 2 weeks ago
- Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.☆199Updated last month
- Demonstration of searching PDF document with Solr, Tika, and Tesseract☆31Updated 9 months ago
- An Elasticsearch ingest processor to do named entity extraction using Apache OpenNLP☆273Updated 2 years ago
- Browser-driven explorer for lucene indexes☆74Updated 3 years ago
- Benchmark of open source, embedded, memory-mapped, key-value stores available from Java (JMH)☆142Updated 2 years ago
- Apache NLPCraft - API to convert natural language into actions.☆82Updated 2 months ago
- Ingest processor doing language detection for fields☆72Updated 2 years ago
- WInte.r is a Java framework for end-to-end data integration. The WInte.r framework implements well-known methods for data pre-processing,…☆110Updated 3 years ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆57Updated 4 years ago
- A scalable, mature and versatile web crawler based on Apache Storm☆924Updated this week
- Towards an open source stack for e-commerce search☆149Updated 5 months ago
- Java library for reading and writing WARC files with a typed API☆49Updated 3 weeks ago
- A high performance "thin wrapper" HTTP REST server on top of Apache Lucene☆143Updated last year
- Github mirror of "search/highlighter" - our actual code is hosted with Gerrit (please see https://www.mediawiki.org/wiki/Developer_access…☆102Updated 3 weeks ago
- ACHE is a web crawler for domain-specific search.☆469Updated last year
- XML/Document DB on top of distributed cache☆41Updated 6 years ago
- Distributed processing framework for search solutions☆81Updated 2 years ago
- Various utility scripts for running Lucene performance tests☆217Updated last week
- Migrate Redis data from source to destination☆9Updated 5 years ago
- Hardened Fork of Ranklib learning to rank library☆44Updated 2 years ago