DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.
☆53Jun 12, 2020Updated 6 years ago
Alternatives and similar repositories for dkpro-c4corpus
Users that are interested in dkpro-c4corpus are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- code to remove "noise" from hOCR output of Tesseract OCR.☆14Oct 24, 2016Updated 9 years ago
- UIMA-based text classification framework built on top of DKPro Core and DKPro Lab.☆38May 9, 2026Updated last month
- Ready-to-use examples of dkpro-core components and pipelines.☆34Dec 16, 2023Updated 2 years ago
- Launch AWS Elastic MapReduce jobs that process Common Crawl data.☆49Feb 15, 2017Updated 9 years ago
- Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.☆204Jun 16, 2026Updated 2 weeks ago
- Deploy open-source AI quickly and easily - Special Bonus Offer • AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆38Updated this week
- Applying Reinforcement Learning from Human Feedback to language models to teach them to write short story responses to writing prompts.☆13May 5, 2022Updated 4 years ago
- Common web archive utility code.☆65Updated this week
- Hands-on Training for Recommender Systems☆11Jul 27, 2021Updated 4 years ago
- Heuristic based boilerplate removal tool☆819Feb 25, 2025Updated last year
- Jupyter notebook tutorials about fundamental machine learning algorithms☆10Aug 10, 2022Updated 3 years ago
- Wronging a Right: Generating Better Errors to Improve Grammatical Error Detection☆16Jan 2, 2019Updated 7 years ago
- official diybookscanner repository☆39May 11, 2014Updated 12 years ago
- Code for the article "Shortcutted Commonsense: Data Spuriousness in Deep Learning of Commonsense Reasoning", Outstanding Paper at EMNLP20…☆10Nov 7, 2021Updated 4 years ago
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- Enrycher API☆13Apr 19, 2012Updated 14 years ago
- Weakly Supervised Text-to-SQL Parsing through Question Decomposition☆23Nov 22, 2023Updated 2 years ago
- Design algorithms for cross document coreference resolution☆17Dec 27, 2013Updated 12 years ago
- Korean large emotion labeled dataset (EmoNSMC)☆14Mar 5, 2020Updated 6 years ago
- 🕸 YALC: Yet Another LOD Cloud (registry of Linked Open Datasets).☆15Aug 21, 2023Updated 2 years ago
- ☆15Oct 4, 2024Updated last year
- Python code to automatically produce a summary of a piece of text.☆11Sep 8, 2016Updated 9 years ago
- A cluster implementation of simhash near-duplicate detection☆32Mar 11, 2015Updated 11 years ago
- Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.☆38Aug 12, 2018Updated 7 years ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- Configuration Space Exploration Framework☆16Oct 13, 2020Updated 5 years ago
- The weights for the embedding layer of Scandinavian UMLFiT language models☆32Dec 5, 2019Updated 6 years ago
- Library for fast text representation and classification.☆31Jan 9, 2024Updated 2 years ago
- Set of scripts to aid in the download of the GDELT data files from www.gdeltproject.org☆12May 17, 2014Updated 12 years ago
- Python wrapper for UMLS REST API☆10Dec 17, 2018Updated 7 years ago
- 문장단위로 분절된 나무위키 데이터셋. Releases에서 다운로드 받거나, tfds-korean을 통해 다운로드 받으세요.☆19Jun 16, 2021Updated 5 years ago
- Code for co-training large language models (e.g. T0) with smaller ones (e.g. BERT) to boost few-shot performance☆16Sep 23, 2022Updated 3 years ago
- trovilo collects and prepares files from Kubernetes ConfigMaps for Prometheus & friends☆15May 21, 2019Updated 7 years ago
- Ukrainian ELECTRA model☆12Mar 11, 2023Updated 3 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Korean Nested Named Entity Corpus☆20May 13, 2023Updated 3 years ago
- Segtok v2 is here: https://github.com/fnl/syntok -- A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic fe…☆170Dec 15, 2021Updated 4 years ago
- Code for the CIKM 2013 paper "Discovering Coherent Topics Using General Knowledge"☆11Jul 14, 2014Updated 11 years ago
- ☆10Jun 10, 2016Updated 10 years ago
- Generate nice CLI from a function signature.☆19Apr 25, 2023Updated 3 years ago
- Events and Situations Ontology☆14Apr 20, 2018Updated 8 years ago
- 네이버 영화 리뷰데이터를 활용한 한글 텍스트 감정 분석☆12Aug 22, 2018Updated 7 years ago