CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop
☆56Apr 26, 2021Updated 4 years ago
Alternatives and similar repositories for cc-warc-examples
Users that are interested in cc-warc-examples are comparing it to the libraries listed below
Sorting:
- ☆14Jan 3, 2024Updated 2 years ago
- Neural Solr = Solr 9 + Mighty Inference + Node☆18Jun 9, 2022Updated 3 years ago
- Import entities from another Wikibase instance (e.g. Wikidata)☆13May 21, 2023Updated 2 years ago
- Metadata is lost when copying files around. It happens with cp, tar, rsync, Finder, Transmit, PathFinder. etc.☆27Jan 17, 2015Updated 11 years ago
- Tools to Work with the Web Archive Ecosystem in R☆21Aug 20, 2017Updated 8 years ago
- Hadoop tools for manipulating ClueWeb collections☆26Jul 15, 2016Updated 9 years ago
- Library for Object Linking and Embedding (OLE) data types☆12Nov 27, 2025Updated 3 months ago
- Transaction scoring demo with RedisAI☆17May 15, 2024Updated last year
- Generic framework for information extraction tasks, including recognition of named entities, temporal expressions, spatial expressions an…☆13Jun 5, 2023Updated 2 years ago
- 🔬Experimental Minio (S3) Gateway for iRODS 💾☆12Aug 13, 2019Updated 6 years ago
- CDXJ Indexing of WARC/ARCs☆33Dec 10, 2024Updated last year
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆38Mar 12, 2026Updated last week
- A memory-based morphological parser for Python☆16Oct 12, 2012Updated 13 years ago
- Code for preservation simulation/modeling project☆10Aug 24, 2021Updated 4 years ago
- List of people, organisations, groups, … doing datavis in Berlin☆11Feb 20, 2026Updated 3 weeks ago
- Node wrapper for Ark-TweetNLP.☆16Nov 8, 2015Updated 10 years ago
- utility to fetch provenance information from Internet Archive's Wayback Machine☆14Feb 5, 2026Updated last month
- Browser based post correction tool for Alto XML files☆14Sep 20, 2013Updated 12 years ago
- DoSeR with entity disambiguation components only☆16Jan 29, 2019Updated 7 years ago
- Exercises I've done for learning the Drools Rules Language☆12Jun 16, 2013Updated 12 years ago
- a rails engine to create Microsoft Word documents from your rails application☆20Jan 9, 2026Updated 2 months ago
- This is the ETL lib package. It provides an API to munge and prepare JSON, TSV and other data using Apache Tika and JSON parsing/loading …☆18Jan 27, 2024Updated 2 years ago
- This is a Fact based Question Answering System using Apache Solr as backend search engine, Wikipedia dumps as information source, Apache …☆26Jan 21, 2026Updated last month
- COVID-19(2019-nCoV) traces data Knowledge Graph☆17Apr 6, 2020Updated 5 years ago
- Add IIP layering support to the Leaflet library☆14Jul 28, 2016Updated 9 years ago
- Read deCSSed DVD image☆33Jul 5, 2016Updated 9 years ago
- Workshop material on Rust iterators, pattern matching and creative coding☆12Sep 8, 2022Updated 3 years ago
- publish a simple website from a public google drive folder☆20Aug 22, 2016Updated 9 years ago
- Manuals, lexica, OCR test data for PoCoTo and the profiler☆15Jul 2, 2021Updated 4 years ago
- Description des formats de fichier☆11Feb 4, 2022Updated 4 years ago
- ☆20Oct 13, 2016Updated 9 years ago
- Prototype wikidata portal project.☆10May 3, 2024Updated last year
- Application which supports the UNC Libraries' Digital Collections Repository☆12Updated this week
- A tool for detecting viruses and NSFW material in WARC files☆18Dec 16, 2025Updated 3 months ago
- Test files for conformance testing and benchmarking Jpylyzer.☆18Apr 2, 2024Updated last year
- ☆15Nov 3, 2020Updated 5 years ago
- ☆11May 1, 2022Updated 3 years ago
- ActivityPub, Activitystreams on Solid POD platform☆15Jan 6, 2023Updated 3 years ago
- Django app for managing PREMIS Events☆14Mar 9, 2026Updated last week