[NeurIPS 2024] πΈ GlotCC Dataset and Pipline
β20Apr 6, 2025Updated last year
Alternatives and similar repositories for GlotCC
Users that are interested in GlotCC are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- [WWW 2026] πΈ GlotWeb: Web Indexing for Minority Languagesβ17Apr 14, 2026Updated 2 months ago
- β10Oct 2, 2024Updated last year
- [NAACL 2024] A Framework aims to wisely initialize unseen subword embeddings in PLMs for efficient large-scale continued pretrainingβ18Nov 26, 2023Updated 2 years ago
- Repository for "Self-Distillation for Model Stacking Unlocks Cross-Lingual NLU in 200+ Languages"β15Oct 4, 2024Updated last year
- KnowMAN: Weakly Supervised Multinomial Adversarial Networksβ12Nov 9, 2021Updated 4 years ago
- Deploy on Railway without the complexity - Free Credits Offer β’ AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- GlotEval: a unified evaluation toolkit designed to benchmark multilingual Large Language Models (LLMs) in a language-specific wayβ18Nov 4, 2025Updated 7 months ago
- The SETimes.HR+ Croatian dependency treebankβ16Dec 27, 2016Updated 9 years ago
- Gather pagegraph data from all over the internetβ32Jun 3, 2026Updated 2 weeks ago
- A RAG that can scale π§π»βπ»β11May 28, 2024Updated 2 years ago
- Klexikon: A German Dataset for Joint Summarization and Simplificationβ16Oct 5, 2022Updated 3 years ago
- Fast search index for SPLADE sparse retrieval models implemented in Python using Numpy and Numbaβ38Oct 16, 2025Updated 8 months ago
- Overview of corpora/datasets for Germanic low-resource languages and dialects. Accompanies "A Survey of Corpora for Germanic Low-Resourceβ¦β27Feb 16, 2026Updated 4 months ago
- β17Jan 5, 2023Updated 3 years ago
- Official Repository for "Hypencoder: Hypernetworks for Information Retrieval"β40Sep 20, 2025Updated 8 months ago
- Deploy to Railway using AI coding agents - Free Credits Offer β’ AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Refined Direct Preference Optimization with Synthetic Data for Behavioral Alignment of LLMsβ13Feb 13, 2024Updated 2 years ago
- π€ HuggingFace Inference Toolkit for Google Cloud Vertex AI (similar to SageMaker's Inference Toolkit, but for Vertex AI and unofficial)β17Mar 20, 2024Updated 2 years ago
- β45Feb 11, 2026Updated 4 months ago
- β23Aug 13, 2018Updated 7 years ago
- Sparse Embedding Compression for Scalable Retrieval in Recommender Systemsβ36Nov 21, 2025Updated 6 months ago
- Semantically Search Emojis From the Command Line!β13Nov 26, 2023Updated 2 years ago
- A missing piece of the Python multitask (both threads and processes) API: An extension that supports stateful worker pools & size-aware iβ¦β29Mar 8, 2026Updated 3 months ago
- β25Apr 28, 2020Updated 6 years ago
- Starbucks: Improved Training for 2D Matryoshka Embeddingsβ23Jun 30, 2025Updated 11 months ago
- Deploy on Railway without the complexity - Free Credits Offer β’ AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- π Fine-tune OpenAI models for text classification, question answering, and moreβ17May 1, 2023Updated 3 years ago
- Code for ACL 2022 paper "Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation"β30Apr 2, 2022Updated 4 years ago
- Efficient encoder-decoder architecture for small language models (β€1B parameters) with cross-architecture knowledge distillation and visiβ¦β32Feb 7, 2025Updated last year
- Code, results and other artifacts from the paper introducing the WildChat-50m dataset and the Re-Wild model family.β38Apr 1, 2025Updated last year
- Tool for sentiment analysis annotationβ13Mar 26, 2025Updated last year
- Featurize words into orthographic and phonological vectors.β42May 20, 2023Updated 3 years ago
- Finite-state script normalization and processing utilitiesβ51Updated this week
- Code and data for the WSDM '19 paper "Crosslingual Document Embedding as Reduced-Rank Ridge Regression (Cr5)"β30Aug 17, 2019Updated 6 years ago
- The Code and Script of "David's Slingshot: A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis"β34Jun 13, 2025Updated last year
- Deploy on Railway without the complexity - Free Credits Offer β’ AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- A neural network that jointly part-of-speech tags and lemmatizes sentences, boosting accuracy for morphologically-rich languages (Czech, β¦β34Apr 5, 2019Updated 7 years ago
- Keyphrase Extraction Prototypesβ15Nov 24, 2016Updated 9 years ago
- Python library to use Pleias-RAG modelsβ72Jun 10, 2026Updated last week
- Rhythm analysis toolkit in Pythonβ13Sep 29, 2023Updated 2 years ago
- Proteus is an experimental platform that combines the power of Large Language Models with the Genesis physics engineβ25Dec 20, 2024Updated last year
- WorldModel is a MaskGIT model trained on 8x8x8 Minecraft voxel volumes. Beyond generating blocks from scratch, it excels in filling spaceβ¦β14Sep 12, 2023Updated 2 years ago
- John Langford's original release of Vowpal Wabbit -- a fast online learning algorithmβ16Jul 25, 2017Updated 8 years ago