CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop
☆56Apr 26, 2021Updated 4 years ago
Alternatives and similar repositories for cc-warc-examples
Users that are interested in cc-warc-examples are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ☆14Jan 3, 2024Updated 2 years ago
- Metadata is lost when copying files around. It happens with cp, tar, rsync, Finder, Transmit, PathFinder. etc.☆27Jan 17, 2015Updated 11 years ago
- This is a NLP pipeline based on RedisGears☆14Mar 26, 2023Updated 3 years ago
- Tools to Work with the Web Archive Ecosystem in R☆21Aug 20, 2017Updated 8 years ago
- Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.☆38Aug 12, 2018Updated 7 years ago
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Simple example on how to use Naive Bayes on Spark using the popular Reuters 21578 dataset☆23Jul 20, 2014Updated 11 years ago
- 🔬Experimental Minio (S3) Gateway for iRODS 💾☆12Aug 13, 2019Updated 6 years ago
- CDXJ Indexing of WARC/ARCs☆33Dec 10, 2024Updated last year
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆38Mar 12, 2026Updated 3 weeks ago
- A memory-based morphological parser for Python☆16Oct 12, 2012Updated 13 years ago
- Code for preservation simulation/modeling project☆10Aug 24, 2021Updated 4 years ago
- List of people, organisations, groups, … doing datavis in Berlin☆11Mar 17, 2026Updated 3 weeks ago
- Tools for tracking stories on news homepages☆48Oct 22, 2019Updated 6 years ago
- Webscraping da Jurisprudência do Tribunal de Justiça do Distrito Federal☆11Dec 8, 2018Updated 7 years ago
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- API implementation, User Interface, and more modules of the IPTC EXTRA project☆13Feb 14, 2022Updated 4 years ago
- Converts WARC files to static HTML☆52Sep 18, 2025Updated 6 months ago
- Java command line tool to convert PAGE XML files with layout and text content to PDF☆10Apr 27, 2020Updated 5 years ago
- Node wrapper for Ark-TweetNLP.☆16Nov 8, 2015Updated 10 years ago
- utility to fetch provenance information from Internet Archive's Wayback Machine☆15Feb 5, 2026Updated 2 months ago
- Extraction code used to create the Dresden Web Table Corpus☆14Feb 25, 2015Updated 11 years ago
- Repository for the paper: "Using deep learning to predict outcomes of legal appeals better than human experts"☆10Aug 1, 2022Updated 3 years ago
- Browser based post correction tool for Alto XML files☆14Sep 20, 2013Updated 12 years ago
- Lint CLI for languagetool.☆14Apr 5, 2023Updated 3 years ago
- Wordpress hosting with auto-scaling on Cloudways • AdFully Managed hosting built for WordPress-powered businesses that need reliable, auto-scalable hosting. Cloudways SafeUpdates now available.
- A fork of the disktype disk and disk image format detection tool☆11Nov 16, 2016Updated 9 years ago
- DoSeR with entity disambiguation components only☆16Jan 29, 2019Updated 7 years ago
- Rails application to support the Sloan Dash grant project for self-deposit submission of scholarly works.☆17Aug 13, 2019Updated 6 years ago
- COVID-19(2019-nCoV) traces data Knowledge Graph☆17Apr 6, 2020Updated 6 years ago
- Add IIP layering support to the Leaflet library☆14Jul 28, 2016Updated 9 years ago
- Read deCSSed DVD image☆33Jul 5, 2016Updated 9 years ago
- Workshop material on Rust iterators, pattern matching and creative coding☆12Sep 8, 2022Updated 3 years ago
- publish a simple website from a public google drive folder☆20Aug 22, 2016Updated 9 years ago
- Node.js bindings to Tantivy Search☆13Dec 8, 2022Updated 3 years ago
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- ☆10Aug 15, 2017Updated 8 years ago
- Research Object manager command line tool(s) and web service☆16Nov 30, 2017Updated 8 years ago
- Repository for revision of PREMIS OWL ontology group☆13May 12, 2022Updated 3 years ago
- A semantic analysis tool to generate synonym.txt files for Solr. [RETIRED]☆25Sep 14, 2016Updated 9 years ago
- Global Names Index☆22Jul 26, 2021Updated 4 years ago
- A tool for detecting viruses and NSFW material in WARC files☆18Dec 16, 2025Updated 3 months ago
- Test files for conformance testing and benchmarking Jpylyzer.☆18Apr 2, 2024Updated 2 years ago