huggingface / roots-search-tool
Scripts supporting the development and serving the Roots Search Tool - https://hf.co/spaces/bigscience-data/roots-search
☆10Updated last year
Related projects ⓘ
Alternatives and complementary repositories for roots-search-tool
- Hugging Face and Pyserini interoperability☆19Updated last year
- A Streamlit app to add structured tags to a dataset card☆22Updated 2 years ago
- Highly specialized crate to parse and use `google/sentencepiece` 's precompiled_charsmap in `tokenizers`☆18Updated 2 years ago
- Neural Solr = Solr 9 + Mighty Inference + Node☆16Updated 2 years ago
- 🚂 Fine-tune OpenAI models for text classification, question answering, and more☆16Updated last year
- Source code and data for Like a Good Nearest Neighbor☆28Updated 9 months ago
- A file utility for accessing both local and remote files through a unified interface.☆35Updated 3 months ago
- Starbucks: Improved Training for 2D Matryoshka Embeddings☆17Updated 3 weeks ago
- Efficiently computing & storing token n-grams from large corpora☆15Updated last month
- ☆19Updated last year
- Tokenization across languages. Useful as preprocessing for subword tokenization.☆22Updated last year
- 🤗 Disaggregators: Curated data labelers for in-depth analysis.☆65Updated last year
- Code for the paper "Code-Mixing on Sesame Street: Dawn of the Adversarial Polyglots" (NAACL-HLT 2021)☆10Updated 2 years ago
- Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.☆92Updated last year
- ☆22Updated 2 years ago
- Documentation effort for the BookCorpus dataset☆31Updated 3 years ago
- This is a new metric that can be used to evaluate faithfulness of text generated by LLMs. The work behind this repository can be found he…☆31Updated last year
- Code for SaGe subword tokenizer (EACL 2023)☆22Updated last month
- Plug-and-play Search Interfaces with Pyserini and Hugging Face☆32Updated last year
- A starter kit for evaluating benchmarks on the 🤗 Hub☆13Updated 10 months ago
- A utility for labeling clusters of text data.☆28Updated 3 years ago
- Code for "CyberWallE at SemEval-2020 Task 11: An Analysis of Feature Engineering for Ensemble Models for Propaganda Detection" (V. Blasch…☆9Updated 3 years ago
- Minimum Description Length probing for neural network representations☆16Updated last week
- Efficient BM25 with DuckDB 🦆☆29Updated 3 weeks ago
- ☆15Updated 3 months ago
- An extension package of 🤗 Datasets that provides support for executing arbitrary SQL queries on HF datasets☆31Updated 9 months ago
- Ranking of fine-tuned HF models as base models.☆35Updated last year
- Library for fast text representation and classification.☆28Updated 10 months ago
- ☆19Updated 3 years ago
- Ongoing research training transformer language models at scale, including: BERT & GPT-2☆18Updated last year