DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.
☆52Jun 12, 2020Updated 5 years ago
Alternatives and similar repositories for dkpro-c4corpus
Users that are interested in dkpro-c4corpus are comparing it to the libraries listed below
Sorting:
- Wronging a Right: Generating Better Errors to Improve Grammatical Error Detection☆16Jan 2, 2019Updated 7 years ago
- Weakly Supervised Text-to-SQL Parsing through Question Decomposition☆23Nov 22, 2023Updated 2 years ago
- Ready-to-use examples of dkpro-core components and pipelines.☆35Dec 16, 2023Updated 2 years ago
- Background materials for the article "Productivity Assessment of Neural Code Completion"☆13Jul 11, 2023Updated 2 years ago
- Linked Data to Natural Language☆11Jan 6, 2024Updated 2 years ago
- A platform for collecting, analyzing, and visualizing social media data.☆13Dec 27, 2020Updated 5 years ago
- 🕸 YALC: Yet Another LOD Cloud (registry of Linked Open Datasets).☆15Aug 21, 2023Updated 2 years ago
- Crude server returning data in turtle from analog, digital, and temperature sensors of an arduino☆10Feb 24, 2021Updated 5 years ago
- Korean large emotion labeled dataset (EmoNSMC)☆14Mar 5, 2020Updated 5 years ago
- Tower Parse: Low-Resource Dependency Parsing via Hierarchical Source Selection☆15Aug 20, 2021Updated 4 years ago
- Common web archive utility code.☆61Feb 6, 2026Updated 3 weeks ago
- SPARQL-LD: A SPARQL Extension for Fetching and Querying Linked Data☆17Jul 3, 2023Updated 2 years ago
- Zero-Shot Translation implemented by Transformer☆14Mar 24, 2023Updated 2 years ago
- Semantic File Inspector ‒ RDF-based metadata extraction and semantic search☆19Mar 19, 2025Updated 11 months ago
- Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.☆202Jan 4, 2026Updated last month
- Generate nice CLI from a function signature.☆18Apr 25, 2023Updated 2 years ago
- Applying Reinforcement Learning from Human Feedback to language models to teach them to write short story responses to writing prompts.☆14May 5, 2022Updated 3 years ago
- Java based viewer for PAGE XML files (layout + text content). Also supports ALTO XML, FineReader XML, and HOCR.☆35May 25, 2023Updated 2 years ago
- A small HTTP API for SyntaxNet☆19Apr 7, 2019Updated 6 years ago
- ML models often mispredict, and it is hard to tell when and why. We present a data mining based approach to discover whether there is a c…☆17Jun 6, 2022Updated 3 years ago
- evaluation suite for testing automatic grammatical error corrections☆39Jun 12, 2017Updated 8 years ago
- A PMTA-based sink application that does opens, clicks, bounces, OOBs and FBLs☆18Sep 3, 2025Updated 5 months ago
- A web interface to understand language-specific BERT-models☆18Apr 16, 2024Updated last year
- A cluster implementation of simhash near-duplicate detection☆32Mar 11, 2015Updated 10 years ago
- 🌸 Train floret vectors☆18May 4, 2023Updated 2 years ago
- ☆20Jun 29, 2017Updated 8 years ago
- OCRopus model for Gothic print (Fraktur)☆19Feb 16, 2020Updated 6 years ago
- Convert RDF data to relational databases☆18Feb 26, 2018Updated 8 years ago
- A pythonic wrapper for Stanford CoreNLP.☆107Oct 19, 2025Updated 4 months ago
- ☆25Feb 20, 2026Updated last week
- 문장단위로 분절된 나무위키 데이터셋. Releases에서 다운로드 받거나, tfds-korean을 통해 다운로드 받으세요.☆19Jun 16, 2021Updated 4 years ago
- Real-time query spark and visualise it as graph.☆24Oct 4, 2017Updated 8 years ago
- Astrea is a software that generates SHACL shapes for one or more OWL ontologies using a set of SPARQL queries that hold the equivalence b…☆17Jan 11, 2023Updated 3 years ago
- ☆87Jun 2, 2022Updated 3 years ago
- The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.☆153Dec 5, 2025Updated 2 months ago
- Part of eMOP: Franken+ tool for creating font training for Tesseract OCR engine from page images.☆24Sep 24, 2015Updated 10 years ago
- Repo originally for a talk at Normconf☆21Jan 12, 2023Updated 3 years ago
- Deployment of pywb as a CommonCrawl Index Server☆21Oct 6, 2017Updated 8 years ago
- MeCab model trained with OpenKorPos.☆23Jun 19, 2022Updated 3 years ago