[NeurIPS 2024] πΈ GlotCC Dataset and Pipline
β20Apr 6, 2025Updated last year
Alternatives and similar repositories for GlotCC
Users that are interested in GlotCC are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- [ACL 2025] π Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignmentβ11Apr 6, 2025Updated last year
- [WWW 2026] πΈ GlotWeb: Web Indexing for Minority Languagesβ17Apr 14, 2026Updated last month
- β10Oct 2, 2024Updated last year
- [NAACL 2024] A Framework aims to wisely initialize unseen subword embeddings in PLMs for efficient large-scale continued pretrainingβ18Nov 26, 2023Updated 2 years ago
- GlotEval: a unified evaluation toolkit designed to benchmark multilingual Large Language Models (LLMs) in a language-specific wayβ18Nov 4, 2025Updated 6 months ago
- Deploy to Railway using AI coding agents - Free Credits Offer β’ AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMsβ47Sep 19, 2025Updated 8 months ago
- The SETimes.HR+ Croatian dependency treebankβ16Dec 27, 2016Updated 9 years ago
- A RAG that can scale π§π»βπ»β11May 28, 2024Updated 2 years ago
- Fast search index for SPLADE sparse retrieval models implemented in Python using Numpy and Numbaβ38Oct 16, 2025Updated 7 months ago
- Overview of corpora/datasets for Germanic low-resource languages and dialects. Accompanies "A Survey of Corpora for Germanic Low-Resourceβ¦β27Feb 16, 2026Updated 3 months ago
- β17Jan 5, 2023Updated 3 years ago
- Official Repository for "Hypencoder: Hypernetworks for Information Retrieval"β35Sep 20, 2025Updated 8 months ago
- π€ HuggingFace Inference Toolkit for Google Cloud Vertex AI (similar to SageMaker's Inference Toolkit, but for Vertex AI and unofficial)β17Mar 20, 2024Updated 2 years ago
- β45Feb 11, 2026Updated 3 months ago
- Deploy on Railway without the complexity - Free Credits Offer β’ AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Sparse Embedding Compression for Scalable Retrieval in Recommender Systemsβ35Nov 21, 2025Updated 6 months ago
- Semantically Search Emojis From the Command Line!β13Nov 26, 2023Updated 2 years ago
- A missing piece of the Python multitask (both threads and processes) API: An extension that supports stateful worker pools & size-aware iβ¦β29Mar 8, 2026Updated 2 months ago
- Starbucks: Improved Training for 2D Matryoshka Embeddingsβ23Jun 30, 2025Updated 10 months ago
- π Fine-tune OpenAI models for text classification, question answering, and moreβ17May 1, 2023Updated 3 years ago
- A Python module for retrieving script types of writing systems including alphabets, abjads, abugidas, syllabaries, logographs, featurals β¦β15Jul 19, 2024Updated last year
- Open-source Human Feedback Libraryβ11Oct 25, 2023Updated 2 years ago
- A proposed standard `NOCK` for a Parquet format that supports efficient distributed serialization of multiple kinds of graph technologiesβ21Apr 27, 2026Updated last month
- Code for "BERTifying the Hidden Markov Model for Multi-Source Weakly Supervised Named Entity Recognition"β32Jun 20, 2023Updated 2 years ago
- Serverless GPU API endpoints on Runpod - Get Bonus Credits β’ AdSkip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
- Efficient encoder-decoder architecture for small language models (β€1B parameters) with cross-architecture knowledge distillation and visiβ¦β32Feb 7, 2025Updated last year
- Code for ACL 2022 paper "Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation"β30Apr 2, 2022Updated 4 years ago
- TyDiP Multilingual Politeness dataset and codeβ12Oct 15, 2023Updated 2 years ago
- Code, results and other artifacts from the paper introducing the WildChat-50m dataset and the Re-Wild model family.β38Apr 1, 2025Updated last year
- β43May 27, 2025Updated last year
- Tool for sentiment analysis annotationβ13Mar 26, 2025Updated last year
- Featurize words into orthographic and phonological vectors.β42May 20, 2023Updated 3 years ago
- LVAS-Agent Code Baseβ21Apr 15, 2025Updated last year
- Finite-state script normalization and processing utilitiesβ49May 8, 2026Updated 2 weeks ago
- 1-Click AI Models by DigitalOcean Gradient β’ AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- Code and data for the WSDM '19 paper "Crosslingual Document Embedding as Reduced-Rank Ridge Regression (Cr5)"β30Aug 17, 2019Updated 6 years ago
- The Code and Script of "David's Slingshot: A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis"β34Jun 13, 2025Updated 11 months ago
- Keyphrase Extraction Prototypesβ15Nov 24, 2016Updated 9 years ago
- Python library to use Pleias-RAG modelsβ71May 8, 2026Updated 2 weeks ago
- Rhythm analysis toolkit in Pythonβ13Sep 29, 2023Updated 2 years ago
- Proteus is an experimental platform that combines the power of Large Language Models with the Genesis physics engineβ25Dec 20, 2024Updated last year
- WorldModel is a MaskGIT model trained on 8x8x8 Minecraft voxel volumes. Beyond generating blocks from scratch, it excels in filling spaceβ¦β14Sep 12, 2023Updated 2 years ago