commoncrawl / web-languages
Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code
☆38Updated this week
Alternatives and similar repositories for web-languages:
Users that are interested in web-languages are comparing it to the libraries listed below
- Repo to hold code and track issues for the collection of permissively licensed data☆23Updated this week
- ParaNames: A multilingual resource for parallel names☆31Updated 10 months ago
- Libraries, Archives and Museums (LAM)☆82Updated 2 years ago
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.☆56Updated 8 months ago
- Small python package to measure OCR quality and other related metrics.☆21Updated last year
- Library for fast text representation and classification.☆28Updated last year
- Code for SaGe subword tokenizer (EACL 2023)☆24Updated 4 months ago
- The robust European language model benchmark.☆94Updated this week
- A polite and user-friendly downloader for Common Crawl data☆36Updated last week
- Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)☆70Updated 11 months ago
- A survey of corpora for Germanic low-resource languages and dialects☆25Updated 3 months ago
- Targetted language identifier, based on FastText and Hunspell.☆34Updated last month
- The pipeline for the OSCAR corpus☆167Updated last year
- 💬 Language Identification with Support for More Than 2000 Labels -- EMNLP 2023☆124Updated 4 months ago
- Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023☆100Updated 11 months ago
- AfroLID, a powerful neural toolkit for African languages identification which covers 517 African languages.☆31Updated 3 weeks ago
- A BERT-based application for reusable text classification at scale☆38Updated last year
- LTG-Bert☆31Updated last year
- 🕸 GlotCC Dataset and Pipline -- NeurIPS 2024☆18Updated 5 months ago
- Sentiment Corpus for Swedish 🇸🇪 Norwegian 🇳🇴 Danish 🇩🇰 Finnish 🇫🇮 (and English 🏴)☆15Updated 3 years ago
- A list of awesome open source projects in the machine learning field, who's developers are mainly based in Germany☆42Updated 6 months ago
- The CleanCoNLL dataset from our EMNLP 2023 paper where we corrected annotation errors and inconsistencies in CoNLL-03.☆23Updated 9 months ago
- Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2…☆67Updated 2 years ago
- ☆43Updated last month
- This repository contains an easy and intuitive approach to use SetFit in combination with spaCy.☆78Updated last year
- Tools for managing datasets for governance and training.☆83Updated 2 months ago
- [EMNLP'23] Official Code for "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models"☆30Updated 5 months ago
- My NER Experiments with ModernBERT☆18Updated 2 months ago
- 🔢 Work with static vector models☆23Updated 2 months ago
- ☆21Updated 2 months ago