commoncrawl / web-languages
Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code
☆35Updated last week
Alternatives and similar repositories for web-languages:
Users that are interested in web-languages are comparing it to the libraries listed below
- Language Identification with Support for More Than 2000 Labels -- EMNLP 2023☆118Updated 3 months ago
- Repo to hold code and track issues for the collection of permissively licensed data☆23Updated 2 months ago
- Libraries, Archives and Museums (LAM)☆82Updated 2 years ago
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.☆56Updated 7 months ago
- AfroLID, a powerful neural toolkit for African languages identification which covers 517 African languages.☆31Updated last year
- Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)☆70Updated 10 months ago
- A python package to run inference with HuggingFace language and vision-language checkpoints wrapping many convenient features.☆26Updated 5 months ago
- SpaCyEx allows the creation of spaCy Matcher patterns with RegEx like syntax.☆59Updated 10 months ago
- This repository contains an easy and intuitive approach to use SetFit in combination with spaCy.☆76Updated last year
- ☆83Updated 2 months ago
- ☆67Updated 11 months ago
- Pre-train Static Word Embeddings☆47Updated last month
- The CleanCoNLL dataset from our EMNLP 2023 paper where we corrected annotation errors and inconsistencies in CoNLL-03.☆23Updated 8 months ago
- ParaNames: A multilingual resource for parallel names☆30Updated 9 months ago
- A BERT-based application for reusable text classification at scale☆38Updated last year
- spaCy-wrap is a wrapper library for spaCy for including fine-tuned transformers from Huggingface in your spaCy pipeline allowing you to i…☆46Updated 10 months ago
- The pipeline for the OSCAR corpus☆166Updated last year
- Targetted language identifier, based on FastText and Hunspell.☆34Updated 2 weeks ago
- Python Finite-State Toolkit☆51Updated this week
- Generalist and Lightweight Model for Text Classification☆87Updated this week
- A spaCy custom component that extracts and normalizes temporal expressions☆54Updated 2 years ago
- Code for SaGe subword tokenizer (EACL 2023)☆24Updated 3 months ago
- Library for fast text representation and classification.☆28Updated last year
- One-stop shop for running and fine-tuning transformer-based language models for retrieval☆47Updated this week
- Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.☆77Updated 5 months ago
- 🧪 Cutting-edge experimental spaCy components and features☆96Updated 10 months ago
- Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023☆99Updated 10 months ago
- OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.☆49Updated last month
- The robust European language model benchmark.☆81Updated this week