CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop
☆38Mar 12, 2026Updated last week
Alternatives and similar repositories for cc-warc-examples
Users that are interested in cc-warc-examples are comparing it to the libraries listed below
Sorting:
- Launch AWS Elastic MapReduce jobs that process Common Crawl data.☆49Feb 15, 2017Updated 9 years ago
- Common web archive utility code.☆63Mar 2, 2026Updated 2 weeks ago
- DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…☆52Jun 12, 2020Updated 5 years ago
- Set of scripts to aid in the download of the GDELT data files from www.gdeltproject.org☆12May 17, 2014Updated 11 years ago
- A Python library to simplify batch requests to AWS Services☆12Apr 25, 2020Updated 5 years ago
- Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.☆38Aug 12, 2018Updated 7 years ago
- ☆15Dec 1, 2021Updated 4 years ago
- Discover how you can migrate from traditional deployments to serverless architectures with AWS☆12Feb 1, 2019Updated 7 years ago
- Code for the paper Faster Phrase-Based Decoding by Refining Feature State☆14Jan 9, 2023Updated 3 years ago
- Index Common Crawl archives in tabular format☆126Mar 4, 2026Updated 2 weeks ago
- PredictionIO word2vec engine template (Scala-based parallelized engine)☆12Apr 22, 2015Updated 10 years ago
- XPath extension for extraction from interactive web sites. NOTE: This code is currently out of sync. A more recent, but precompiled versi…☆27Feb 27, 2013Updated 13 years ago
- AWS Lambda layer containing latest version of Apache Tika☆14Jul 10, 2025Updated 8 months ago
- ☆15Updated this week
- Content and Instructions for completing the "Making Things Right with AWS Lambda and AWS Config Rules" Workshop.☆22Nov 27, 2017Updated 8 years ago
- My attempt to learn more than one Deep Learning framework☆15Apr 7, 2019Updated 6 years ago
- Natural language detection, Java bindings for CLD2☆17Feb 26, 2026Updated 3 weeks ago
- A library of examples showing how to use the Common Crawl corpus (2008-2012, ARC format)☆65Aug 5, 2016Updated 9 years ago
- Set of scripts to aid in the download of the GDELT data files from gdelt.utdallas.edu☆18May 14, 2014Updated 11 years ago
- A bidirectional LSTM example for sequence labeling.☆13May 23, 2018Updated 7 years ago
- A SPARQL client for Amazon Neptune that includes AWS Signature Version 4 signing. Implemented as an RDF4J repository.☆23Mar 2, 2026Updated 2 weeks ago
- A simple CDK app written in Kotlin using Gradle DSL☆12Dec 28, 2018Updated 7 years ago
- The Dynamic Rules Engine is a serverless application that enables real-time evaluation of rules against sensor data, leveraging AWS Kines…☆11Sep 25, 2024Updated last year
- Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends☆57Jan 28, 2024Updated 2 years ago
- ☆20Jan 19, 2019Updated 7 years ago
- Ready-to-use examples of dkpro-core components and pipelines.☆35Dec 16, 2023Updated 2 years ago
- PETRARCH actor, agent and verb dictionaries☆22Aug 3, 2018Updated 7 years ago
- CIKM Cup 2016 (1st Place) - Track 1 - Cross Device Entity Linking☆18Sep 19, 2017Updated 8 years ago
- JUnitBenchmarks (git clone of the head SVN)☆58Apr 13, 2015Updated 10 years ago
- ☆14Jul 1, 2025Updated 8 months ago
- Virtual patent marking crawler at iproduct.epfl.ch☆15Sep 13, 2017Updated 8 years ago
- Pure JAX-RS 2.0 ClientRequestFilter/WriterInterceptor used to sign AWS REST requests. Also has presign capabilities.☆15Jan 4, 2022Updated 4 years ago
- Kylo integration with PDND (previously DAF).☆19Nov 16, 2022Updated 3 years ago
- [FFCV-PL] manage fast data loading with ffcv and pytorch lightning☆16Jul 17, 2023Updated 2 years ago
- ☆14Mar 19, 2025Updated last year
- ☆15Aug 15, 2012Updated 13 years ago
- Process Common Crawl data with Python and Spark☆453Jan 20, 2026Updated 2 months ago
- ☆14Jun 13, 2024Updated last year
- Triple Pattern Fragment server that uses Blazegraph as backend☆14May 20, 2023Updated 2 years ago