commoncrawl / web-languages
Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code
☆28Updated this week
Alternatives and similar repositories for web-languages:
Users that are interested in web-languages are comparing it to the libraries listed below
- Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)☆69Updated 8 months ago
- Language Identification with Support for More Than 2000 Labels -- EMNLP 2023☆110Updated last month
- Libraries, Archives and Museums (LAM)☆82Updated 2 years ago
- OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.☆50Updated this week
- Targetted language identifier, based on FastText and Hunspell.☆33Updated 2 months ago
- Sentiment Corpus for Swedish 🇸🇪 Norwegian 🇳🇴 Danish 🇩🇰 Finnish 🇫🇮 (and English 🏴)☆15Updated 3 years ago
- 💥 Use Hugging Face text and token classification pipelines directly in spaCy☆63Updated 10 months ago
- ParaNames: A multilingual resource for parallel names☆30Updated 8 months ago
- Resource and Tool for Writing System Identification -- LREC 2024☆13Updated 7 months ago
- A BERT-based application for reusable text classification at scale☆37Updated last year
- Compass-aligned Distributional Embeddings. Align embeddings from different corpora☆39Updated 2 years ago
- Repo to hold code and track issues for the collection of permissively licensed data☆22Updated last month
- Python Finite-State Toolkit☆47Updated last week
- AfroLID, a powerful neural toolkit for African languages identification which covers 517 African languages.☆30Updated last year
- A list of awesome open source projects in the machine learning field, who's developers are mainly based in Germany☆42Updated 4 months ago
- SpaCyEx allows the creation of spaCy Matcher patterns with RegEx like syntax.☆58Updated 8 months ago
- A python package to run inference with HuggingFace language and vision-language checkpoints wrapping many convenient features.☆25Updated 4 months ago
- A spaCy wrapper of Entity-Fishing (component) for named entity disambiguation and linking on Wikidata☆156Updated 2 years ago
- A Directory of Online Newspaper Sources for 70+ Languages☆32Updated 3 years ago
- Aksharamukha Python Library☆44Updated 3 months ago
- A spaCy wrapper of OpenTapioca for named entity linking on Wikidata☆93Updated last year
- A Python package to compute HONEST, a score to measure hurtful sentence completions in language models. Published at NAACL 2021.☆21Updated 2 years ago
- Evaluation of language models on mono- or multilingual tasks.☆76Updated this week
- spaCy-wrap is a wrapper library for spaCy for including fine-tuned transformers from Huggingface in your spaCy pipeline allowing you to i…☆46Updated 9 months ago
- ☆54Updated last year
- Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2…☆66Updated last year
- Tools for compiling corpora from Common Crawl☆13Updated last month
- PassivePy: A Tool to Automatically Identify Passive Voice in Big Text Data☆18Updated 10 months ago
- The CleanCoNLL dataset from our EMNLP 2023 paper where we corrected annotation errors and inconsistencies in CoNLL-03.☆22Updated 6 months ago