CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop
☆56Apr 26, 2021Updated 5 years ago
Alternatives and similar repositories for cc-warc-examples
Users that are interested in cc-warc-examples are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ☆14Jan 3, 2024Updated 2 years ago
- Import entities from another Wikibase instance (e.g. Wikidata)☆13May 21, 2023Updated 3 years ago
- Metadata is lost when copying files around. It happens with cp, tar, rsync, Finder, Transmit, PathFinder. etc.☆27Jan 17, 2015Updated 11 years ago
- Crawlera tools☆26Feb 9, 2016Updated 10 years ago
- Hadoop tools for manipulating ClueWeb collections☆26Jul 15, 2016Updated 9 years ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.☆37Aug 12, 2018Updated 7 years ago
- Transaction scoring demo with RedisAI☆18May 15, 2024Updated 2 years ago
- Quick starter guide for java based Natural Language Processing training, saving model, loading model and inference.☆12Jul 9, 2018Updated 7 years ago
- 🔬Experimental Minio (S3) Gateway for iRODS 💾☆12Aug 13, 2019Updated 6 years ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆38Mar 12, 2026Updated 3 months ago
- List of people, organisations, groups, … doing datavis in Berlin☆11Apr 13, 2026Updated last month
- Code for preservation simulation/modeling project☆10Aug 24, 2021Updated 4 years ago
- Tools for tracking stories on news homepages☆48Oct 22, 2019Updated 6 years ago
- API implementation, User Interface, and more modules of the IPTC EXTRA project☆13Feb 14, 2022Updated 4 years ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Java command line tool to convert PAGE XML files with layout and text content to PDF☆10Apr 27, 2020Updated 6 years ago
- Node wrapper for Ark-TweetNLP.☆16Nov 8, 2015Updated 10 years ago
- Converts WARC files to static HTML☆56Sep 18, 2025Updated 8 months ago
- utility to fetch provenance information from Internet Archive's Wayback Machine☆15Feb 5, 2026Updated 4 months ago
- Browser based post correction tool for Alto XML files☆14Sep 20, 2013Updated 12 years ago
- DoSeR with entity disambiguation components only☆16Jan 29, 2019Updated 7 years ago
- A fork of the disktype disk and disk image format detection tool☆11Nov 16, 2016Updated 9 years ago
- This is the ETL lib package. It provides an API to munge and prepare JSON, TSV and other data using Apache Tika and JSON parsing/loading …☆18Jan 27, 2024Updated 2 years ago
- Rails application to support the Sloan Dash grant project for self-deposit submission of scholarly works.☆17Aug 13, 2019Updated 6 years ago
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- COVID-19(2019-nCoV) traces data Knowledge Graph☆17Apr 6, 2020Updated 6 years ago
- A set of reusable Java components that implement functionality common to any web crawler☆258Jun 3, 2026Updated last week
- Add IIP layering support to the Leaflet library☆14Jul 28, 2016Updated 9 years ago
- Workshop material on Rust iterators, pattern matching and creative coding☆12Sep 8, 2022Updated 3 years ago
- publish a simple website from a public google drive folder☆20Aug 22, 2016Updated 9 years ago
- ☆11Jul 18, 2016Updated 9 years ago
- Manuals, lexica, OCR test data for PoCoTo and the profiler☆15Jul 2, 2021Updated 4 years ago
- Prototype wikidata portal project.☆10May 3, 2024Updated 2 years ago
- This is a Fact based Question Answering System using Apache Solr as backend search engine, Wikipedia dumps as information source, Apache …☆26Jan 21, 2026Updated 4 months ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Research Object manager command line tool(s) and web service☆16Nov 30, 2017Updated 8 years ago
- Repository for revision of PREMIS OWL ontology group☆13May 12, 2022Updated 4 years ago
- Application which supports the UNC Libraries' Digital Collections Repository☆12Updated this week
- Global Names Index☆22Jul 26, 2021Updated 4 years ago
- A tool for detecting viruses and NSFW material in WARC files☆18Updated this week
- NG Tool is Bash Script to allow create, delete, enable all, disable all, enable single or disable single vhosts for nginx virtual hosts …☆13Dec 15, 2023Updated 2 years ago
- Test files for conformance testing and benchmarking Jpylyzer.☆18Apr 2, 2024Updated 2 years ago