commoncrawl / web-languages
Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code
☆39Updated last week
Alternatives and similar repositories for web-languages:
Users that are interested in web-languages are comparing it to the libraries listed below
- Repo to hold code and track issues for the collection of permissively licensed data☆23Updated 2 weeks ago
- 💬 Language Identification with Support for More Than 2000 Labels -- EMNLP 2023☆127Updated 5 months ago
- Libraries, Archives and Museums (LAM)☆82Updated 2 years ago
- Code for SaGe subword tokenizer (EACL 2023)☆24Updated 4 months ago
- LTG-Bert☆32Updated last year
- Pre-train Static Word Embeddings☆56Updated 2 weeks ago
- Trully flash implementation of DeBERTa disentangled attention mechanism.☆45Updated 2 weeks ago
- ☆67Updated last year
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.☆58Updated 8 months ago
- Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023☆100Updated last year
- A BERT-based application for reusable text classification at scale☆38Updated last year
- ☆46Updated 2 months ago
- Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.☆93Updated 2 years ago
- BPE modification that implements removing of the intermediate tokens during tokenizer training.☆25Updated 5 months ago
- Library for fast text representation and classification.☆28Updated last year
- SpaCyEx allows the creation of spaCy Matcher patterns with RegEx like syntax.☆59Updated 11 months ago
- An introduction to LLM Sampling☆77Updated 4 months ago
- Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.☆31Updated last year
- The pipeline for the OSCAR corpus☆168Updated last year
- minimal pytorch implementation of bm25 (with sparse tensors)☆101Updated last year
- ParaNames: A multilingual resource for parallel names☆31Updated 11 months ago
- 💫 SpaCy wrapper for ConceptNet 💫☆92Updated last year
- 💥 Use Hugging Face text and token classification pipelines directly in spaCy☆63Updated last year
- Small python package to measure OCR quality and other related metrics.☆21Updated last year
- 🤗 HuggingFace Inference Toolkit for Google Cloud Vertex AI (similar to SageMaker's Inference Toolkit, but for Vertex AI and unofficial)☆17Updated last year
- Generalist and Lightweight Model for Text Classification☆123Updated 2 weeks ago
- ☆41Updated 2 months ago
- Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2…☆67Updated 2 years ago
- [EMNLP'23] Official Code for "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models"☆30Updated 6 months ago
- ☆89Updated 4 months ago