commoncrawl / web-languagesLinks
Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code
β42Updated last month
Alternatives and similar repositories for web-languages
Users that are interested in web-languages are comparing it to the libraries listed below
Sorting:
- Repo to hold code and track issues for the collection of permissively licensed dataβ26Updated this week
- π¬ Language Identification with Support for More Than 2000 Labels -- EMNLP 2023β135Updated 6 months ago
- Libraries, Archives and Museums (LAM)β84Updated 2 years ago
- ParaNames: A multilingual resource for parallel namesβ32Updated last year
- The robust European language model benchmark.β104Updated this week
- Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023β101Updated last year
- Crosslingual Question Answering for African Languagesβ30Updated 8 months ago
- β67Updated last year
- Small python package to measure OCR quality and other related metrics.β22Updated last year
- AfroLID, a powerful neural toolkit for African languages identification which covers 517 African languages.β31Updated 2 months ago
- π’ Work with static vector modelsβ28Updated last month
- β94Updated 5 months ago
- Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2β¦β67Updated 2 years ago
- Code for the MTEB Arenaβ19Updated 8 months ago
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.β59Updated 10 months ago
- The pipeline for the OSCAR corpusβ167Updated last year
- A BERT-based application for reusable text classification at scaleβ38Updated last year
- Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)β74Updated 2 months ago
- Pre-train Static Word Embeddingsβ70Updated this week
- Targetted language identifier, based on FastText and Hunspell.β34Updated 3 months ago
- The Open Parallel Corpusβ71Updated 2 months ago
- MAFAND-MTβ55Updated 10 months ago
- BLOOM+1: Adapting BLOOM model to support a new unseen languageβ72Updated last year
- A survey of corpora for Germanic low-resource languages and dialectsβ25Updated 5 months ago
- Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.β81Updated 8 months ago
- Optimus is a flexible and scalable framework built to train language models efficiently across diverse hardware configurations, includingβ¦β54Updated last month
- Library for fast text representation and classification.β28Updated last year
- Code for SaGe subword tokenizer (EACL 2023)β25Updated 6 months ago
- This repository contains an easy and intuitive approach to use SetFit in combination with spaCy.β79Updated last year
- Benchmark scripts for comparing different tokenizers and sentence segmenters of Germanβ11Updated 2 years ago