tsproisl/SoMaJo

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/tsproisl/SoMaJo)

tsproisl / SoMaJo

A tokenizer and sentence splitter for German and English web and social media texts.

☆153

Alternatives and similar repositories for SoMaJo

Users that are interested in SoMaJo are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

tsproisl / SoMeWeTa
View on GitHub
A part-of-speech tagger with support for domain adaptation and external resources.
☆24Oct 26, 2022Updated 3 years ago
stefan-it / gc4lm
View on GitHub
GC4LM: A Colossal (Biased) language model for German
☆13May 2, 2021Updated 5 years ago
dbmdz / berts
View on GitHub
DBMDZ BERT, DistilBERT, ELECTRA, GPT-2 and ConvBERT models
☆158Dec 6, 2022Updated 3 years ago
LEL-A / GerAlpacaDataCleaned
View on GitHub
German Alpaca Dataset (Cleaned + Translated)
☆26Apr 6, 2023Updated 3 years ago
stefan-it / europeana-bert
View on GitHub
BERT and ELECTRA models trained on Europeana Newspapers
☆39Dec 14, 2021Updated 4 years ago
Simple, predictable pricing with DigitalOcean hosting • Ad
Always know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
adbar / German-NLP
View on GitHub
Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
☆527Oct 30, 2024Updated last year
mocdaniel / docker-cqpweb
View on GitHub
A containerized all-in-one solution for CQPWeb
☆18Jan 22, 2023Updated 3 years ago
LoicGrobol / ginger
View on GitHub
Format conversion and graphical representation of [Universal Dependencies](http://universaldependencies.org) trees.
☆12Sep 3, 2024Updated last year
GarfieldLyu / OCR_POST_DE
View on GitHub
OCR post correction for old German corpus
☆20Aug 29, 2022Updated 3 years ago
stefan-it / german-gpt2
View on GitHub
German GPT-2 model
☆32Aug 17, 2021Updated 4 years ago
elenanereiss / Legal-Entity-Recognition
View on GitHub
A Dataset of German Legal Documents for Named Entity Recognition
☆179Oct 19, 2022Updated 3 years ago
repodiac / german_compound_splitter
View on GitHub
Compound splitter for German language ("Komposita-Zerlegung") based on large dictionary combined with highly efficient multi-pattern stri…
☆36Jul 7, 2022Updated 4 years ago
ausgerechnet / cwb-ccc
View on GitHub
Python wrapper for the CWB to extract concordances and score frequency lists
☆22May 11, 2026Updated 2 months ago
PolMine / GermaParlTEI
View on GitHub
GermaParl: Corpus of Plenary Protocols of the German Bundestag (TEI Format)
☆39Jun 1, 2023Updated 3 years ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
German-NLP-Group / german-transformer-training
View on GitHub
Plan and train German transformer models.
☆23Feb 22, 2021Updated 5 years ago
tblock / 10kGNAD
View on GitHub
Ten Thousand German News Articles Dataset for Topic Classification
☆88Nov 7, 2022Updated 3 years ago
GermanT5 / wikipedia2corpus
View on GitHub
Wikipedia text corpus for self-supervised NLP model training
☆47Jul 17, 2022Updated 4 years ago
dtuggener / CharSplit
View on GitHub
Compound splitter for German
☆114Apr 5, 2020Updated 6 years ago
EdCo95 / text-summarization
View on GitHub
Python code to automatically produce a summary of a piece of text.
☆11Sep 8, 2016Updated 9 years ago
telekom / HPOflow
View on GitHub
Tools for Optuna, MLflow and the integration of both.
☆17May 28, 2023Updated 3 years ago
pyconll / pyconll
View on GitHub
A minimal, pure Python library to interface with CoNLL-U format files.
☆155Jul 6, 2026Updated 3 weeks ago
LSX-UniWue / SuperGLEBer
View on GitHub
German Language Understanding Evaluation Benchmark @NAACL24
☆22Dec 11, 2025Updated 7 months ago
pd3f / dehyphen
View on GitHub
📜 Dehyphenation of broken text (mainly German), i.e., extracted from a PDF
☆39Mar 8, 2022Updated 4 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
dataforgoodfr / bechdelai
View on GitHub
Automating the Bechdel test and its variants for feminine representation in movies with AI
☆37Nov 22, 2023Updated 2 years ago
proycon / spacy2folia
View on GitHub
Use spaCy for NLP and output to the FoLiA XML format.
☆12Feb 27, 2024Updated 2 years ago
t-systems-on-site-services-gmbh / german-elmo-model
View on GitHub
This is a german ELMo deep contextualized word representation. It is trained on a special German Wikipedia Text Corpus.
☆28Dec 15, 2019Updated 6 years ago
t-systems-on-site-services-gmbh / german-wikipedia-text-corpus
View on GitHub
This is a german text corpus from Wikipedia. It is cleaned, preprocessed and sentence splitted. It's purpose is to train NLP embeddings l…
☆23Feb 22, 2022Updated 4 years ago
oliverguhr / german-sentiment
View on GitHub
A data set and model for german sentiment classification.
☆70Jul 17, 2026Updated last week
UniversalDependencies / UD_German-GSD
View on GitHub
☆20May 6, 2026Updated 2 months ago
tudarmstadt-lt / GermaNER
View on GitHub
GermaNER: Free Open German Named Entity Recognition Tool
☆38Dec 16, 2023Updated 2 years ago
valentinhofmann / flota
View on GitHub
☆18Feb 1, 2023Updated 3 years ago
groceryheist / misclassificationmodels
View on GitHub
☆12Jun 5, 2025Updated last year
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
EuropeanaNewspapers / ner-corpora
View on GitHub
Named Entity Recognition data for Europeana Newspapers
☆173Apr 5, 2023Updated 3 years ago
uds-lsv / GermEval-2018-Data
View on GitHub
This repository contains all manually labeled data from the GermEval-2018 shared task.
☆29Sep 28, 2018Updated 7 years ago
ccs-amsterdam / annotinder-r
View on GitHub
R package for working with the CCS Annotator
☆13Mar 14, 2024Updated 2 years ago
hucsmn / suffix_array
View on GitHub
suffix array construction and searching algorithms for in-memory binary data.
☆13Sep 10, 2022Updated 3 years ago
uhh-lt / targer
View on GitHub
A web application tagging and retrieval of arguments in text
☆30May 1, 2023Updated 3 years ago
ghpaetzold / massalign
View on GitHub
Alignment and annotation for comparable documents.
☆22Oct 16, 2018Updated 7 years ago
LeonieWeissweiler / CISTEM
View on GitHub
Stemmer for German
☆45Apr 29, 2022Updated 4 years ago