CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop
☆56Apr 26, 2021Updated 5 years ago
Alternatives and similar repositories for cc-warc-examples
Users that are interested in cc-warc-examples are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ☆14Jan 3, 2024Updated 2 years ago
- Import entities from another Wikibase instance (e.g. Wikidata)☆13May 21, 2023Updated 3 years ago
- Getting started with Redis Streams & Java☆10Dec 2, 2024Updated last year
- Metadata is lost when copying files around. It happens with cp, tar, rsync, Finder, Transmit, PathFinder. etc.☆27Jan 17, 2015Updated 11 years ago
- Crawlera tools☆26Feb 9, 2016Updated 10 years ago
- Open source password manager - Proton Pass • AdSecurely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
- Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.☆38Aug 12, 2018Updated 7 years ago
- Simple example on how to use Naive Bayes on Spark using the popular Reuters 21578 dataset☆23Jul 20, 2014Updated 11 years ago
- CDXJ Indexing of WARC/ARCs☆34May 11, 2026Updated last month
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆38Updated this week
- Tools for tracking stories on news homepages☆47Oct 22, 2019Updated 6 years ago
- API implementation, User Interface, and more modules of the IPTC EXTRA project☆13Feb 14, 2022Updated 4 years ago
- Java command line tool to convert PAGE XML files with layout and text content to PDF☆10Apr 27, 2020Updated 6 years ago
- Blog Helper is a Alexa skill that provides a voice interface for WordPress.com blogs☆13Mar 3, 2025Updated last year
- Converts WARC files to static HTML☆58Sep 18, 2025Updated 9 months ago
- End-to-end encrypted cloud storage - Proton Drive • AdSpecial offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
- Extraction code used to create the Dresden Web Table Corpus☆14Feb 25, 2015Updated 11 years ago
- Declarative parser for remapping object schemas and data☆14Dec 19, 2022Updated 3 years ago
- A fork of the disktype disk and disk image format detection tool☆11Nov 16, 2016Updated 9 years ago
- Tools for Creating Universal Numeric Fingerprints for Data☆22Apr 12, 2022Updated 4 years ago
- my dissertation!☆12Sep 6, 2022Updated 3 years ago
- Rails application to support the Sloan Dash grant project for self-deposit submission of scholarly works.☆17Aug 13, 2019Updated 6 years ago
- Add IIP layering support to the Leaflet library☆14Jul 28, 2016Updated 9 years ago
- publish a simple website from a public google drive folder☆20Aug 22, 2016Updated 9 years ago
- Manuals, lexica, OCR test data for PoCoTo and the profiler☆15Jul 2, 2021Updated 5 years ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Layouts designed in the Angular Material. The Angular Material project is an implementation of Material Design in Angular.js.☆12Jan 18, 2016Updated 10 years ago
- ☆21Oct 13, 2016Updated 9 years ago
- ☆10Aug 15, 2017Updated 8 years ago
- This is a Fact based Question Answering System using Apache Solr as backend search engine, Wikipedia dumps as information source, Apache …☆26Jan 21, 2026Updated 5 months ago
- Research Object manager command line tool(s) and web service☆16Nov 30, 2017Updated 8 years ago
- Application which supports the UNC Libraries' Digital Collections Repository☆12Updated this week
- Keystone Password Reset Plugin for Keystone 4.0☆13May 16, 2019Updated 7 years ago
- Global Names Index☆22Jul 26, 2021Updated 4 years ago
- A tool for detecting viruses and NSFW material in WARC files☆18Jun 9, 2026Updated 3 weeks ago
- Simple, predictable pricing with DigitalOcean hosting • AdAlways know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
- Advanced Reconnaissance Framework — Google Chrome extension☆31Mar 4, 2016Updated 10 years ago
- A Java library to generate random data for all sorts of things. Java random data faker☆28Mar 31, 2021Updated 5 years ago
- ☆15Nov 3, 2020Updated 5 years ago
- SIARD (Software Independent Archiving of Relational Databases) - an open file format for the long-term archiving of relational databases☆12Nov 14, 2024Updated last year
- WordWanderer – take your text for a walk☆12May 14, 2019Updated 7 years ago
- ☆20Jun 4, 2021Updated 5 years ago
- Instantly take encrypted notes inside Chrome or Firefox☆17Sep 30, 2019Updated 6 years ago