commoncrawl / web-languagesLinks
Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code
☆45Updated this week
Alternatives and similar repositories for web-languages
Users that are interested in web-languages are comparing it to the libraries listed below
Sorting:
- 💬 Language Identification with Support for More Than 2000 Labels -- EMNLP 2023☆138Updated 2 weeks ago
- Pre-train Static Word Embeddings☆79Updated 3 weeks ago
- ☆96Updated 6 months ago
- 🔢 Work with static vector models☆28Updated 2 months ago
- Small python package to measure OCR quality and other related metrics.☆23Updated last year
- ☆67Updated last year
- A library for working with prompt templates locally or on the Hugging Face Hub.☆46Updated 3 months ago
- Code for SaGe subword tokenizer (EACL 2023)☆25Updated 6 months ago
- Extracts plain text, language identification and more metadata from WARC records☆22Updated 3 months ago
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.☆59Updated 10 months ago
- Trully flash implementation of DeBERTa disentangled attention mechanism.☆58Updated last month
- Using open source LLMs to build synthetic datasets for direct preference optimization☆64Updated last year
- ☆56Updated 3 weeks ago
- Generalist and Lightweight Model for Text Classification☆133Updated last week
- Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)☆74Updated 2 months ago
- Open information and community for machine translation☆78Updated last week
- ParaNames: A multilingual resource for parallel names☆34Updated last year
- Efficiently find the best-suited language model (LM) for your NLP task☆124Updated 2 weeks ago
- Libraries, Archives and Museums (LAM)☆84Updated 2 years ago
- The robust European language model benchmark.☆104Updated last week
- [EMNLP 2023 Demo] fabricator - annotating and generating datasets with large language models.☆109Updated last year
- Benchmark scripts for comparing different tokenizers and sentence segmenters of German☆11Updated 2 years ago
- ☆22Updated 4 months ago
- SpaCyEx allows the creation of spaCy Matcher patterns with RegEx like syntax.☆59Updated last year
- Scripts to convert datasets from various sources to Hugging Face Datasets.☆57Updated 2 years ago
- 🤗 HuggingFace Inference Toolkit for Google Cloud Vertex AI (similar to SageMaker's Inference Toolkit, but for Vertex AI and unofficial)☆17Updated last year
- Fact checking baseline combining dense retrieval and textual entailment☆29Updated 5 months ago
- NLP with Rust for Python 🦀🐍☆62Updated last month
- 💫 SpaCy wrapper for ConceptNet 💫☆94Updated last year
- Crosslingual Question Answering for African Languages☆30Updated 8 months ago