cisnlp / GlotWebLinks
πΈ GlotWeb: Web Indexing for Low-Resource Languages -- under construction.
β15Updated last month
Alternatives and similar repositories for GlotWeb
Users that are interested in GlotWeb are comparing it to the libraries listed below
Sorting:
- π Resource and Tool for Writing System Identification -- LREC 2024β19Updated last year
- Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)β74Updated 5 months ago
- AfroLID, a powerful neural toolkit for African languages identification which covers 517 African languages.β32Updated 6 months ago
- NTREX -- News Test References for MT Evaluationβ85Updated last year
- A survey of corpora for Germanic low-resource languages and dialectsβ25Updated 9 months ago
- These are lists for a variety of languages containing words that are distinctive to each language.β38Updated 3 years ago
- OpusFilter - Parallel corpus processing toolkitβ109Updated last month
- ParaNames: A multilingual resource for parallel namesβ36Updated last year
- GC4LM: A Colossal (Biased) language model for Germanβ13Updated 4 years ago
- A tiny BERT for low-resource monolingual modelsβ31Updated 11 months ago
- A python module for evaluating NERC and NEL system performances as defined in the HIPE shared tasks (formerly CLEF-HIPE-2020-scorer).β14Updated last year
- OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.β51Updated 2 months ago
- Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023β104Updated last year
- π Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignmentβ11Updated 5 months ago
- Curriculum trainingβ18Updated 2 months ago
- Zero-shot Transfer Learning from English to Arabicβ30Updated 3 years ago
- Code and data for the IWSLT 2022 shared task on Formality Control for SLTβ21Updated 2 years ago
- Targetted language identifier, based on FastText and Hunspell.β37Updated 2 weeks ago
- A library of translation-based text similarity measuresβ25Updated last year
- GLADIS: A General and Large Acronym Disambiguation Benchmark (EACL 23)β18Updated last year
- Implementation of "SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages" paper, accepted to Eβ¦β24Updated 2 years ago
- Repository for the paper "MultiNERD: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguatioβ¦β44Updated last year
- Seed Machine Translation Dataβ33Updated 10 months ago
- Statistics on multilingual datasetsβ17Updated 3 years ago
- Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2β¦β68Updated 2 years ago
- Code for ACL 2022 paper "Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation"β30Updated 3 years ago
- A simple neural truecaser written in pytorch and allennlp.β33Updated last year
- Easier Automatic Sentence Simplification Evaluationβ161Updated last year
- Code for the paper "Getting the most out of your tokenizer for pre-training and domain adaptation"β20Updated last year
- Extracts plain text, language identification and more metadata from WARC recordsβ23Updated 2 weeks ago