LAW-Unimi / BUbiNG
The LAW next generation crawler.
☆87Updated 3 years ago
Alternatives and similar repositories for BUbiNG
Users that are interested in BUbiNG are comparing it to the libraries listed below
Sorting:
- A framework to benchmark different graph databases, based on generated data from customizable schema, distribution, and size.☆25Updated 6 years ago
- Common web archive utility code.☆55Updated 2 months ago
- Angular JS Solr and Elasticsearch and OpenSearch Diagnostic Search Services☆26Updated 2 months ago
- A language detection Web Service☆53Updated 8 years ago
- Combines Apache OpenNLP and Apache Tika and provides facilities for automatically deriving sentiment from text.☆34Updated 2 years ago
- Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or fi…☆188Updated this week
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆215Updated 2 years ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆38Updated 5 months ago
- A set of reusable Java components that implement functionality common to any web crawler☆244Updated 3 weeks ago
- Suite of tools for detecting changes in web pages and their rendering☆54Updated last year
- ☆49Updated 8 years ago
- Text similarity based on Word2Vec vectors.☆11Updated 8 years ago
- An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)☆24Updated 7 years ago
- A cookiecutter template for an elasticsearch ingest processor plugin☆47Updated 2 years ago
- Bullet is a streaming query engine that can be plugged into any singular data stream using a Stream Processing framework like Apache Stor…☆41Updated 2 years ago
- ☆12Updated 4 years ago
- A python library detect and extract listing data from HTML page.☆108Updated 8 years ago
- How to spot first stories on Twitter using Storm.☆125Updated last year
- A comparative benchmark between relational database systems and their graph based counterpart.☆37Updated 7 years ago
- Site Hound (previously THH) is a Domain Discovery Tool☆23Updated 3 years ago
- PageRank in Spark☆74Updated 2 years ago
- Extract statistics from Wikipedia Dump files.☆26Updated 3 years ago
- A curated list of Awesome Apache Solr links and resources.☆107Updated 3 years ago
- A curated list of Awesome Apache Lucene links and resources.☆27Updated 6 years ago
- WARC (Web Archive) Input and Output Formats for Hadoop☆35Updated 10 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆46Updated 7 years ago
- Aiohttp web server API, which scrapes Google and returns scrape results as response. Supports proxies, multiple geos and number of result…☆56Updated last year
- A framework for scalable graph computing.☆147Updated 6 years ago
- A distributed database with a built in streaming data platform☆58Updated 3 months ago
- An HTTP proxy for Elasticsearch, Solr (etc.) to prevent a 100% full disk situation.☆11Updated 6 years ago