A tokenizer and sentence splitter for German and English web and social media texts.
☆153Dec 9, 2024Updated last year
Alternatives and similar repositories for SoMaJo
Users that are interested in SoMaJo are comparing it to the libraries listed below
Sorting:
- A part-of-speech tagger with support for domain adaptation and external resources.☆24Oct 26, 2022Updated 3 years ago
- A lemmatizer for German language text☆94Feb 7, 2023Updated 3 years ago
- GC4LM: A Colossal (Biased) language model for German☆13May 2, 2021Updated 4 years ago
- A Dataset of German Legal Documents for Named Entity Recognition☆174Oct 19, 2022Updated 3 years ago
- Combining encoder-based language models☆11Nov 11, 2021Updated 4 years ago
- Use spaCy for NLP and output to the FoLiA XML format.☆12Feb 27, 2024Updated 2 years ago
- BERT and ELECTRA models trained on Europeana Newspapers☆38Dec 14, 2021Updated 4 years ago
- Wikipedia text corpus for self-supervised NLP model training☆46Jul 17, 2022Updated 3 years ago
- Ten Thousand German News Articles Dataset for Topic Classification☆87Nov 7, 2022Updated 3 years ago
- Dataset and code for directed sentiment analysis in news text.☆16Jun 2, 2021Updated 4 years ago
- A minimal, pure Python library to interface with CoNLL-U format files.☆153Dec 5, 2025Updated 3 months ago
- ☆88Dec 5, 2021Updated 4 years ago
- Compound splitter for German language ("Komposita-Zerlegung") based on large dictionary combined with highly efficient multi-pattern stri…☆35Jul 7, 2022Updated 3 years ago
- This repository contains all manually labeled data from the GermEval-2018 shared task.☆29Sep 28, 2018Updated 7 years ago
- Compound splitter for German☆112Apr 5, 2020Updated 5 years ago
- GermaParl: Corpus of Plenary Protocols of the German Bundestag (TEI Format)☆37Jun 1, 2023Updated 2 years ago
- Named Entity Recognition data for Europeana Newspapers☆173Apr 5, 2023Updated 2 years ago
- Text tokenization and sentence segmentation (segtok v2)☆209Mar 12, 2022Updated 3 years ago
- ☆17Jul 15, 2016Updated 9 years ago
- Identifying Historical People, Places and other Entities: Shared Task on Named Entity Recognition and Linking on Historical Newspapers at…☆21Aug 1, 2024Updated last year
- ☆17Feb 1, 2023Updated 3 years ago
- OCRopus model for Gothic print (Fraktur)☆19Feb 16, 2020Updated 6 years ago
- ☆20Jan 9, 2026Updated last month
- Python code to automatically produce a summary of a piece of text.☆12Sep 8, 2016Updated 9 years ago
- Tools for Optuna, MLflow and the integration of both.☆17May 28, 2023Updated 2 years ago
- GermaNER: Free Open German Named Entity Recognition Tool☆36Dec 16, 2023Updated 2 years ago
- 📜 Dehyphenation of broken text (mainly German), i.e., extracted from a PDF☆39Mar 8, 2022Updated 3 years ago
- Automatic Detection of Potentially Idiomatic Expressions☆12Feb 19, 2021Updated 5 years ago
- ☆13Jan 25, 2026Updated last month
- Format conversion and graphical representation of [Universal Dependencies](http://universaldependencies.org) trees.☆12Sep 3, 2024Updated last year
- Simple CORPORA list crawler☆10Dec 2, 2016Updated 9 years ago
- Alignment and annotation for comparable documents.☆22Oct 16, 2018Updated 7 years ago
- Next-generation Punkt sentence boundary detection with zero dependencies☆29Nov 18, 2025Updated 3 months ago
- A simple model for classifying papers by academic venue (AI/ML/ACL), given a title and abstract. Bare-metal PyTorch port of https://gith…☆12Mar 22, 2018Updated 7 years ago
- The UKWA Heritrix3 custom modules and Docker builder.☆11Dec 2, 2024Updated last year
- A Python implementation of a graph-based parser for Abstract Meaning Representation (AMR)☆11Feb 2, 2018Updated 8 years ago
- R package to interact with the Pushift.io API☆10Aug 4, 2025Updated 7 months ago
- A database of climate change newspaper articles☆16Jan 31, 2026Updated last month
- Promoss Topic Modelling Toolbox☆11Jan 21, 2019Updated 7 years ago