LAW-Unimi / BUbiNG
The LAW next generation crawler.
☆86Updated 3 years ago
Related projects ⓘ
Alternatives and complementary repositories for BUbiNG
- Index Common Crawl archives in tabular format☆106Updated this week
- Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends☆56Updated 9 months ago
- Common web archive utility code.☆50Updated last month
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆42Updated 6 years ago
- Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or fi…☆183Updated this week
- A set of reusable Java components that implement functionality common to any web crawler☆237Updated this week
- A language detection Web Service☆53Updated 7 years ago
- A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/☆181Updated 6 years ago
- API definition, resources and reference implementation of URL Frontiers☆46Updated this week
- Common Crawl fork of Apache Nutch☆28Updated this week
- Scrapy middleware which allows to crawl only new content☆79Updated 2 years ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆56Updated 3 years ago
- A framework to benchmark different graph databases, based on generated data from customizable schema, distribution, and size.☆26Updated 5 years ago
- Angular JS Solr and Elasticsearch and OpenSearch Diagnostic Search Services☆25Updated 5 months ago
- Java implementation of the Internet Research Lab Web Crawler (IRLbot) as presented by Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and D…☆17Updated 7 years ago
- Common Crawl Index Server☆65Updated 10 months ago
- Lightning fast spell correction / fuzzy search library based on SymSpell by Commerce-Experts☆80Updated 6 years ago
- Natural language detection, Java bindings for CLD2☆14Updated last week
- A list of memex-related tools and their repository URLs☆144Updated 6 years ago
- Formasaurus tells you the type of an HTML form and its fields using machine learning☆117Updated 5 months ago
- Deployment of pywb as a CommonCrawl Index Server☆21Updated 7 years ago
- Disk-backed hashmaps for Java☆30Updated 8 years ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆37Updated 4 months ago
- A Java library for working with Table Schema.☆25Updated 10 months ago
- Index URLs in Common Crawl☆193Updated 7 years ago
- A curated list of Awesome Apache Solr links and resources.☆106Updated 3 years ago
- Scraping Tweet data for Russian Troll Twitter accounts into Neo4j☆57Updated 6 years ago
- A high performance "thin wrapper" HTTP REST server on top of Apache Lucene☆137Updated 6 months ago
- Sux4J is an effort to bring succinct data structures to Java.☆154Updated last year
- The Chronos versioning project aims to provide easy-to-use and reliable versioned data storage.☆52Updated 4 years ago