DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.
☆52Jun 12, 2020Updated 5 years ago
Alternatives and similar repositories for dkpro-c4corpus
Users that are interested in dkpro-c4corpus are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- code to remove "noise" from hOCR output of Tesseract OCR.☆14Oct 24, 2016Updated 9 years ago
- UIMA-based text classification framework built on top of DKPro Core and DKPro Lab.☆35Dec 16, 2022Updated 3 years ago
- Ready-to-use examples of dkpro-core components and pipelines.☆35Dec 16, 2023Updated 2 years ago
- Demonstration of using Python to process the Common Crawl dataset with the mrjob framework☆168Jan 27, 2026Updated 2 months ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆38Mar 12, 2026Updated 3 weeks ago
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Zero-Shot Translation implemented by Transformer☆14Mar 24, 2023Updated 3 years ago
- Heuristic based boilerplate removal tool☆814Feb 25, 2025Updated last year
- Wronging a Right: Generating Better Errors to Improve Grammatical Error Detection☆16Jan 2, 2019Updated 7 years ago
- official diybookscanner repository☆39May 11, 2014Updated 11 years ago
- Enrycher API☆13Apr 19, 2012Updated 13 years ago
- Weakly Supervised Text-to-SQL Parsing through Question Decomposition☆23Nov 22, 2023Updated 2 years ago
- Search back-end for dependency tree search. See the docs at https://fginter.github.io/dep_search/☆17Apr 11, 2018Updated 8 years ago
- ☆15Oct 4, 2024Updated last year
- A cluster implementation of simhash near-duplicate detection☆32Mar 11, 2015Updated 11 years ago
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Background materials for the article "Productivity Assessment of Neural Code Completion"☆15Jul 11, 2023Updated 2 years ago
- Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.☆38Aug 12, 2018Updated 7 years ago
- Configuration Space Exploration Framework☆17Oct 13, 2020Updated 5 years ago
- Python wrapper for UMLS REST API☆10Dec 17, 2018Updated 7 years ago
- Master thesis: Exploring bias in German NLG (GPT-3 & GerPT-2). Applies regard classification and bias mitigation triggers.☆16Sep 25, 2024Updated last year
- Ukrainian ELECTRA model☆12Mar 11, 2023Updated 3 years ago
- Korean Nested Named Entity Corpus☆20May 13, 2023Updated 2 years ago
- Segtok v2 is here: https://github.com/fnl/syntok -- A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic fe…☆171Dec 15, 2021Updated 4 years ago
- ☆10Jun 10, 2016Updated 9 years ago
- End-to-end encrypted email - Proton Mail • AdSpecial offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
- Events and Situations Ontology☆14Apr 20, 2018Updated 7 years ago
- Federated Knowledge Extraction Framework☆194Oct 25, 2023Updated 2 years ago
- Data mapping framework for rust stuff☆51Mar 25, 2026Updated 2 weeks ago
- evaluation suite for testing automatic grammatical error corrections☆39Jun 12, 2017Updated 8 years ago
- Contains documentation on suggested API design and best practices.☆14Mar 16, 2017Updated 9 years ago
- Lehigh University Benchmark (LUBM).☆10Apr 22, 2020Updated 5 years ago
- Islandora Solr Search module☆24Jul 28, 2025Updated 8 months ago
- OCRopus model for Gothic print (Fraktur)☆19Feb 16, 2020Updated 6 years ago
- Implementation of QA Networks☆10Jul 14, 2016Updated 9 years ago
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- TRAPPIST: Totally Rad Analysis Pipelines Python Informatics Super Tool (actually, a python-based genomics analysis toolbox)☆14Jan 25, 2018Updated 8 years ago
- Code for SaGe subword tokenizer (EACL 2023)☆28Nov 30, 2024Updated last year
- 🌸 Train floret vectors☆18May 4, 2023Updated 2 years ago
- rustupolis - Tuple Space for Rust.☆11Mar 31, 2026Updated last week
- Semantic File Inspector ‒ RDF-based metadata extraction and semantic search☆19Mar 19, 2025Updated last year
- Spelling corrector in python☆29May 26, 2020Updated 5 years ago
- A small HTTP API for SyntaxNet☆19Apr 7, 2019Updated 7 years ago