LAW-Unimi / BUbiNG
The LAW next generation crawler.
☆87Updated 3 years ago
Alternatives and similar repositories for BUbiNG:
Users that are interested in BUbiNG are comparing it to the libraries listed below
- Common web archive utility code.☆55Updated 2 weeks ago
- Java implementation of the Internet Research Lab Web Crawler (IRLbot) as presented by Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and D…☆16Updated 7 years ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆38Updated 3 months ago
- Site Hound (previously THH) is a Domain Discovery Tool☆23Updated 3 years ago
- A set of reusable Java components that implement functionality common to any web crawler☆243Updated 2 weeks ago
- Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or fi…☆187Updated this week
- Various utilities regarding Levenshtein transducers. (Java)☆57Updated 3 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆44Updated 7 years ago
- API definition, resources and reference implementation of URL Frontiers☆48Updated 2 weeks ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆56Updated 3 years ago
- A framework to benchmark different graph databases, based on generated data from customizable schema, distribution, and size.☆26Updated 6 years ago
- Suite of tools for detecting changes in web pages and their rendering☆54Updated last year
- WARC (Web Archive) Input and Output Formats for Hadoop☆35Updated 10 years ago
- Angular JS Solr and Elasticsearch and OpenSearch Diagnostic Search Services☆26Updated last month
- A cookiecutter template for an elasticsearch ingest processor plugin☆47Updated 2 years ago
- An HTTP proxy for Elasticsearch, Solr (etc.) to prevent a 100% full disk situation.☆11Updated 6 years ago
- Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends☆56Updated last year
- A simple package allowing to use WebGraph data in Python (via the Jython interpreter).☆19Updated 4 years ago
- Tools and other things for people who work on search relevance & information retrieval☆83Updated last year
- Search relevance evaluation toolkit☆31Updated 2 years ago
- Query preprocessor for Java-based search engines (Querqy Core and Solr implementation)☆184Updated 2 weeks ago
- General Architecture for Text Engineering☆48Updated 9 years ago
- A curated list of Awesome Apache Solr links and resources.☆107Updated 3 years ago
- A fast and simple JavaScript library specifically targeted at collecting search and search related browser events.☆40Updated 7 months ago
- A scalable, mature and versatile web crawler based on Apache Storm☆904Updated last week
- A language detection Web Service☆53Updated 7 years ago
- Java text categorization system☆55Updated 7 years ago
- This module contains an implementation of the Nilsimsa locality-sensitive hashing algorithm in Java.☆18Updated 5 years ago
- Text similarity based on Word2Vec vectors.☆11Updated 8 years ago
- A java library for stored queries☆16Updated last year