DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.
☆52Jun 12, 2020Updated 5 years ago
Alternatives and similar repositories for dkpro-c4corpus
Users that are interested in dkpro-c4corpus are comparing it to the libraries listed below
Sorting:
- UIMA-based text classification framework built on top of DKPro Core and DKPro Lab.☆35Dec 16, 2022Updated 3 years ago
- Ready-to-use examples of dkpro-core components and pipelines.☆35Dec 16, 2023Updated 2 years ago
- Demonstration of using Python to process the Common Crawl dataset with the mrjob framework☆168Jan 27, 2026Updated last month
- Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.☆202Mar 1, 2026Updated 2 weeks ago
- A platform for collecting, analyzing, and visualizing social media data.☆13Dec 27, 2020Updated 5 years ago
- Common web archive utility code.☆63Mar 2, 2026Updated 2 weeks ago
- Zero-Shot Translation implemented by Transformer☆14Mar 24, 2023Updated 2 years ago
- Heuristic based boilerplate removal tool☆814Feb 25, 2025Updated last year
- Wronging a Right: Generating Better Errors to Improve Grammatical Error Detection☆16Jan 2, 2019Updated 7 years ago
- Enrycher API☆13Apr 19, 2012Updated 13 years ago
- Weakly Supervised Text-to-SQL Parsing through Question Decomposition☆23Nov 22, 2023Updated 2 years ago
- Search back-end for dependency tree search. See the docs at https://fginter.github.io/dep_search/☆17Apr 11, 2018Updated 7 years ago
- Korean large emotion labeled dataset (EmoNSMC)☆14Mar 5, 2020Updated 6 years ago
- ☆15Oct 4, 2024Updated last year
- 🕸 YALC: Yet Another LOD Cloud (registry of Linked Open Datasets).☆15Aug 21, 2023Updated 2 years ago
- Python code to automatically produce a summary of a piece of text.☆12Sep 8, 2016Updated 9 years ago
- Configuration Space Exploration Framework☆17Oct 13, 2020Updated 5 years ago
- Library for fast text representation and classification.☆31Jan 9, 2024Updated 2 years ago
- Set of scripts to aid in the download of the GDELT data files from www.gdeltproject.org☆12May 17, 2014Updated 11 years ago
- 문장단위로 분절된 나무위키 데이터셋. Releases에서 다운로드 받거나, tfds-korean을 통해 다운로드 받으세요.☆19Jun 16, 2021Updated 4 years ago
- Python wrapper for UMLS REST API☆10Dec 17, 2018Updated 7 years ago
- Master thesis: Exploring bias in German NLG (GPT-3 & GerPT-2). Applies regard classification and bias mitigation triggers.☆16Sep 25, 2024Updated last year
- Web archiving utility library☆11Mar 11, 2026Updated last week
- Ukrainian ELECTRA model☆12Mar 11, 2023Updated 3 years ago
- Code for the CIKM 2013 paper "Discovering Coherent Topics Using General Knowledge"☆11Jul 14, 2014Updated 11 years ago
- Generate nice CLI from a function signature.☆18Apr 25, 2023Updated 2 years ago
- Events and Situations Ontology☆14Apr 20, 2018Updated 7 years ago
- 네이버 영화 리뷰데이터 를 활용한 한글 텍스트 감정 분석☆12Aug 22, 2018Updated 7 years ago
- Simple word to frequency mappings for the german language based on text corpora and using CISTEM stemmer.☆14Apr 3, 2021Updated 4 years ago
- Use spaCy for NLP and output to the FoLiA XML format.☆12Feb 27, 2024Updated 2 years ago
- Federated Knowledge Extraction Framework☆193Oct 25, 2023Updated 2 years ago
- evaluation suite for testing automatic grammatical error corrections☆39Jun 12, 2017Updated 8 years ago
- Data mapping framework for rust stuff☆49Updated this week
- Lehigh University Benchmark (LUBM).☆10Apr 22, 2020Updated 5 years ago
- The Danish Gigaword project☆16Jan 25, 2021Updated 5 years ago
- SMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batchi…☆35May 24, 2024Updated last year
- OCRopus model for Gothic print (Fraktur)☆19Feb 16, 2020Updated 6 years ago
- An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)☆25Oct 9, 2017Updated 8 years ago
- The pipeline for the OSCAR corpus☆176Nov 9, 2025Updated 4 months ago