Brand24-AI / mms_benchmark
The most extensive open massively multilingual corpus of datasets for training sentiment models. The corpus consists of 79 manually selected from over 350 datasets reported in the scientific literature based on strict quality criteria and covers 27 languages.
☆15Updated 10 months ago
Related projects: ⓘ
- Polish RoBERTA model trained on Polish literature, Wikipedia, and Oscar. The major assumption is that quality text will give a good mode…☆33Updated 3 years ago
- ITALIC: An ITALian Intent Classification Dataset☆11Updated 9 months ago
- Repository for SLURP paper☆96Updated 2 years ago
- A merged version of multiple open-source German speech datasets.☆30Updated 4 months ago
- Bicleaner fork that uses neural networks☆37Updated last month
- Various speech datasets made available to the public☆88Updated 2 weeks ago
- ☆56Updated last year
- Generating artificial disfluencies from fluent text easily and promptly☆10Updated last year
- Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2…☆65Updated last year
- Python module to clean and transliterate (i.e. normalize) German text including abbreviations, numbers, timestamps etc. It can be used to…☆29Updated 3 years ago
- This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish☆13Updated 9 months ago
- Linguistic processing for Common Voice☆50Updated 8 months ago
- Materials for "IT5: Large-scale Text-to-text Pretraining for Italian Language Understanding and Generation" 🇮🇹☆30Updated 3 months ago
- Small repo describing how to use Hugging Face's Wav2Vec2 with PyCTCDecode☆109Updated 2 years ago
- Embeddings: State-of-the-art Text Representations for Natural Language Processing tasks, an initial version of library focus on the Polis…☆36Updated 9 months ago
- OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.☆45Updated last week
- GlotLID: Language Identification with Support for More Than 2000 Labels -- EMNLP 2023☆85Updated 2 months ago
- A repository containing the code for speech translation papers.☆21Updated 2 years ago
- An open-source Python package for Danish speech recognition☆28Updated last year
- Python Finite-State Toolkit☆39Updated last month
- A french sequence to sequence pretrained model☆57Updated 2 years ago
- A guide to building language technology in new languages.☆57Updated 2 years ago
- Parse and convert numbers written in French, English or Spanish into their digit representation.☆100Updated last month
- This repository contains a demonstrative implementation for pooling-based models, e.g., DeepPyramidion complementing our paper "Sparsifyi…☆14Updated 2 years ago
- SHAS: Approaching optimal Segmentation for End-to-End Speech Translation☆37Updated last year
- Open Source AI Benchmarking toolkit for benchmarking speech to text services☆54Updated 5 months ago
- Text utilities, including beam search decoding, tokenizing, and more, built for use in Flashlight.☆64Updated 4 months ago
- Evaluation of Sentence Representations in Polish☆21Updated last year
- ☆20Updated 7 months ago
- Universal Romanizer that can convert any unicode script to roman (latin) script☆145Updated last month