LAW-Unimi / BUbiNG
The LAW next generation crawler.
☆87Updated 3 years ago
Alternatives and similar repositories for BUbiNG:
Users that are interested in BUbiNG are comparing it to the libraries listed below
- A set of reusable Java components that implement functionality common to any web crawler☆240Updated last month
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆56Updated 3 years ago
- A language detection Web Service☆52Updated 7 years ago
- Algorithms that build k-nearest neighbors graph (k-nn graph): Brute-force, NN-Descent,...☆34Updated 5 years ago
- Various utilities regarding Levenshtein transducers. (Java)☆57Updated 3 years ago
- Traptor -- A distributed Twitter feed☆26Updated 2 years ago
- Disk-backed hashmaps for Java☆30Updated 8 years ago
- This module contains an implementation of the Nilsimsa locality-sensitive hashing algorithm in Java.☆18Updated 5 years ago
- Java port of TLSH (Trend Micro Locality Sensitive Hash)☆20Updated 3 years ago
- Probabilistic data structures server. The data model is key-value, where values are: Bloomfilters, LinearCounters, HyperLogLogs, CountMin…☆24Updated 8 years ago
- A distributed in-memory key-value storage for billions of small objects.☆23Updated 5 years ago
- Suite of tools for detecting changes in web pages and their rendering☆54Updated last year
- Combines Apache OpenNLP and Apache Tika and provides facilities for automatically deriving sentiment from text.☆33Updated last year
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆212Updated 2 years ago
- Lightning fast spell correction / fuzzy search library based on SymSpell by Commerce-Experts☆81Updated 6 years ago
- Java implementation of the Sparkey key value store☆120Updated 11 months ago
- High-performance pattern matching algorithms in Java☆80Updated 4 years ago
- Common Crawl fork of Apache Nutch☆29Updated last week
- Site Hound (previously THH) is a Domain Discovery Tool☆23Updated 3 years ago
- A curated list of Awesome Apache Solr links and resources.☆107Updated 3 years ago
- Dump TheMovieDB☆24Updated 3 years ago
- Java implementation of the Internet Research Lab Web Crawler (IRLbot) as presented by Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and D…☆16Updated 7 years ago
- A Mixed Trie and Levenshtein distance implementation in Java for extremely fast prefix string searching and string similarity.☆43Updated 2 years ago
- Java text categorization system☆55Updated 7 years ago
- Angular JS Solr and Elasticsearch and OpenSearch Diagnostic Search Services☆25Updated 6 months ago
- Index Common Crawl archives in tabular format☆109Updated 2 months ago
- A java library for stored queries☆16Updated last year
- JSuffixArrays (Suffix Arrays in Java)☆59Updated 7 years ago
- Java Matrix Benchmark is a tool for evaluating Java linear algebra libraries for speed, stability, and memory usage.☆59Updated last year