LAW-Unimi / BUbiNG
The LAW next generation crawler.
☆87Updated 3 years ago
Alternatives and similar repositories for BUbiNG:
Users that are interested in BUbiNG are comparing it to the libraries listed below
- Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or fi…☆188Updated this week
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆56Updated 4 years ago
- A language detection Web Service☆53Updated 7 years ago
- Lightning fast spell correction / fuzzy search library based on SymSpell by Commerce-Experts☆81Updated 6 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆46Updated 7 years ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆38Updated 4 months ago
- A simple package allowing to use WebGraph data in Python (via the Jython interpreter).☆19Updated 4 years ago
- An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)☆24Updated 7 years ago
- A set of reusable Java components that implement functionality common to any web crawler☆244Updated 3 weeks ago
- A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/☆189Updated 6 years ago
- Angular JS Solr and Elasticsearch and OpenSearch Diagnostic Search Services☆26Updated last month
- Index Common Crawl archives in tabular format☆118Updated last month
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆213Updated 2 years ago
- Common Crawl Index Server☆68Updated last month
- API definition, resources and reference implementation of URL Frontiers☆48Updated this week
- A curated list of Awesome Apache Solr links and resources.☆107Updated 3 years ago
- Common Crawl fork of Apache Nutch☆33Updated 3 weeks ago
- Extract statistics from Wikipedia Dump files.☆26Updated 3 years ago
- Java port of SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm☆67Updated 4 years ago
- This module contains an implementation of the Nilsimsa locality-sensitive hashing algorithm in Java.☆18Updated 5 years ago
- Combines Apache OpenNLP and Apache Tika and provides facilities for automatically deriving sentiment from text.☆34Updated last year
- A cookiecutter template for an elasticsearch ingest processor plugin☆47Updated 2 years ago
- A curated list of Awesome Apache Lucene links and resources.☆26Updated 6 years ago
- Suite of tools for detecting changes in web pages and their rendering☆54Updated last year
- WARC (Web Archive) Input and Output Formats for Hadoop☆35Updated 10 years ago
- Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends☆56Updated last year
- Lucene Auto Phrase TokenFilter implementation☆59Updated 6 years ago
- Visualization of result returning by Solr 6 graph query☆10Updated 8 years ago
- Site Hound (previously THH) is a Domain Discovery Tool☆23Updated 3 years ago
- Machine-readable Taxonomies with ID mappings☆65Updated 7 years ago