mayhewsw/multilingual-data-stats

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/mayhewsw/multilingual-data-stats)

mayhewsw / multilingual-data-stats

Statistics on multilingual datasets

☆17

Alternatives and similar repositories for multilingual-data-stats

Users that are interested in multilingual-data-stats are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

murali1996 / CodemixedNLP
View on GitHub
CodemixedNLP: An Extensible and Open NLP Toolkit for Code-Switching
☆18Mar 29, 2021Updated 5 years ago
pentagonalize / Transformer-Cookbook
View on GitHub
☆18Feb 4, 2025Updated last year
warnikchow / prosem
View on GitHub
Prosody-semantics Interface in Seoul Korean
☆12Oct 9, 2020Updated 5 years ago
carina-kauf / better-mlm-scoring
View on GitHub
[Kauf & Ivanova, ACL 2023] A Better Way to Do Masked Language Model Scoring
☆12Dec 1, 2023Updated 2 years ago
vered1986 / panic
View on GitHub
PANiC - PAraphrasing Noun-Compounds
☆15Apr 6, 2018Updated 8 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
mbollmann / sonnet-finder
View on GitHub
Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet.
☆13Jan 5, 2023Updated 3 years ago
kanishkamisra / wugs-and-daxes
View on GitHub
Collection of academic works in natural language processing, computational linguistics, and computational cognitive science that study th…
☆22Mar 20, 2024Updated 2 years ago
MeLeLBGU / tokenizers_intrinsic_benchmark
View on GitHub
Code for the paper "Greed is All You Need: An Evaluation of Tokenizer Inference Methods"
☆13Nov 26, 2024Updated last year
ShaojieJiang / tldr
View on GitHub
Source code repo for paper "TLDR: Token Loss Dynamic Reweighting for Reducing Repetitive Utterance Generation"
☆10Aug 11, 2023Updated 2 years ago
taucompling / mdlrnn
View on GitHub
Minimum Description Length Recurrent Neural Networks
☆19Jun 9, 2023Updated 3 years ago
nmrksic / attract-repel
View on GitHub
The Attract-Repel algorithm presented in (Mrkšić et al., TACL 2017), with accompanying resources.
☆63Sep 22, 2017Updated 8 years ago
magizbox / scraper
View on GitHub
Scraper
☆13Dec 21, 2018Updated 7 years ago
gautierdag / tokenizer-bench
View on GitHub
Code for the paper "Getting the most out of your tokenizer for pre-training and domain adaptation"
☆22Feb 14, 2024Updated 2 years ago
UIUCLearningLanguageLab / AOCHILDES
View on GitHub
Python API for loading language data from American-English CHILDES database
☆18Aug 14, 2022Updated 3 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
google-research-datasets / TF-IDF-IIF-top100-wordlists
View on GitHub
These are lists for a variety of languages containing words that are distinctive to each language.
☆42Apr 5, 2022Updated 4 years ago
pchizhov / picky_bpe
View on GitHub
BPE modification that implements removing of the intermediate tokens during tokenizer training.
☆27Nov 25, 2024Updated last year
MicrosoftTranslator / NTREX
View on GitHub
NTREX -- News Test References for MT Evaluation
☆87Jun 5, 2024Updated 2 years ago
zouharvi / pearmut
View on GitHub
Platform for Evaluating and Reviewing of Multilingual Tasks
☆32Updated this week
JunjieHu / amber
View on GitHub
Explicit Alignment Objectives for Multilingual Bidirectional Encoders
☆14Apr 14, 2021Updated 5 years ago
rctatman / data-prep-minitoolkit
View on GitHub
Three little Python scripts for data preparation: remove commas, add commas, concatenate files
☆16Jul 26, 2017Updated 8 years ago
wellecks / mgs
View on GitHub
MLE-Guided Parameter Search (AAAI 2021)
☆12Sep 16, 2021Updated 4 years ago
SAP / software-documentation-data-set-for-machine-translation
View on GitHub
A parallel evaluation data set of SAP software documentation with document structure annotation
☆15Jun 12, 2026Updated last month
rguthrie3 / MorphologicalPriorsForWordEmbeddings
View on GitHub
Code for EMNLP 2016 paper: Morphological Priors for Probabilistic Word Embeddings
☆54Dec 6, 2016Updated 9 years ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
baoy-nlp / DSS-VAE-pytorch
View on GitHub
Generating Sentences from Disentangled Syntactic and Semantic Spaces
☆11Jun 24, 2019Updated 7 years ago
ARBML / dar
View on GitHub
A simple semi-supervised approach for creating huggingface data script loaders and upload to the hub.
☆11Jun 23, 2024Updated 2 years ago
dustinvtran / blog
View on GitHub
All code and content for my blog.
☆15Sep 23, 2018Updated 7 years ago
UBC-NLP / afrolid
View on GitHub
AfroLID, a powerful neural toolkit for African languages identification which covers 517 African languages.
☆39Feb 5, 2026Updated 5 months ago
bvanaken / clinical-assertion-data
View on GitHub
Dataset for the NLPMC @ NAACL 2021 Paper: Assertion Detection in Clinical Notes: Medical Language Models to the Rescue?
☆16Sep 28, 2021Updated 4 years ago
jungokasai / beam_with_patience
View on GitHub
☆46Apr 13, 2022Updated 4 years ago
yuvalpinter / nytwit
View on GitHub
New York Times Word Innovation Types dataset
☆21Dec 1, 2020Updated 5 years ago
cindyxinyiwang / SDE
View on GitHub
Source code for the paper "Multilingual Neural Machine Translation with Soft Decoupled Encoding"
☆29Jun 2, 2021Updated 5 years ago
zlsh80826 / MSMARCO
View on GitHub
Machine Comprehension Train on MSMARCO with S-NET Extraction Modification
☆31Feb 10, 2023Updated 3 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
riiswa / covid19-attestation-deplacement-form
View on GitHub
Formulaire en ligne qui génère une attestation de déplacement dérogatoire
☆10Mar 18, 2020Updated 6 years ago
cpllab / syntactic-generalization
View on GitHub
Code and data for "A Systematic Assessment of Syntactic Generalization in Neural Language Models"
☆31Jun 18, 2021Updated 5 years ago
SapienzaNLP / conception
View on GitHub
Code and experiments for the COLING2020 paper "Conception: Multilingually-Enhanced, Human-Readable Concept Vector Representations".
☆11Dec 9, 2020Updated 5 years ago
jungwhank / transformer-pl
View on GitHub
Transformer Implementation for NMT using PyTorch Lightning (Korean to English)
☆10Oct 19, 2020Updated 5 years ago
salgado / music-search
View on GitHub
Code from blog 'Searching by Music: Leveraging Vector Search for Music Information Retrieval'
☆16Nov 16, 2023Updated 2 years ago
anirudh9119 / walkback_nips17
View on GitHub
Variational Walkback, NIPS'17
☆28Oct 18, 2017Updated 8 years ago
Riccorl / sense-embedding
View on GitHub
BabelNet (and WordNet) sense embedding trained with Word2Vec and FastText
☆10Sep 3, 2019Updated 6 years ago