Code for the paper "Greed is All You Need: An Evaluation of Tokenizer Inference Methods"
☆13Nov 26, 2024Updated last year
Alternatives and similar repositories for tokenizers_intrinsic_benchmark
Users that are interested in tokenizers_intrinsic_benchmark are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- PathPiece tokenizer☆14Nov 10, 2024Updated last year
- Official implementation of "Data Mixture Inference: What do BPE tokenizers reveal about their training data?"☆18May 15, 2025Updated 11 months ago
- 🔍 Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment☆11Apr 6, 2025Updated last year
- Code for "Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding" (EMNLP 2020).☆11May 1, 2025Updated 11 months ago
- Complete set of English dialect transformation rules and evaluation code☆16Jun 7, 2024Updated last year
- Wordpress hosting with auto-scaling - Free Trial • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- 🌍 A simple script for taking automated screenshots from a Leaflet map☆15Mar 29, 2018Updated 8 years ago
- Can Large Language Models Identify Authorship? (EMNLP 2024 Findings)☆13Feb 4, 2025Updated last year
- This repository includes pneumonia detection on Chest X-ray Images by using Deep Learning(Keras).☆21Nov 6, 2022Updated 3 years ago
- ☆10Nov 8, 2023Updated 2 years ago
- Code and models for the CVPR 2017 paper "DeepNav: Learning to Navigate Large Cities"☆13Feb 16, 2020Updated 6 years ago
- Split bib files for anthology bibliography for overleaf☆11Aug 25, 2024Updated last year
- PANiC - PAraphrasing Noun-Compounds☆15Apr 6, 2018Updated 8 years ago
- Demo server for TREC LiveQA competition☆11Dec 7, 2016Updated 9 years ago
- Detail-Sensitive Panoramic Annular Semantic Segmentation☆12May 19, 2022Updated 3 years ago
- Serverless GPU API endpoints on Runpod - Bonus Credits • AdSkip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
- ACL 2021 paper "Style is NOT a single variable: Case Studies for Cross-Style Language Understanding " by Dongyeop Kang and Eduard Hovy☆15Jul 19, 2021Updated 4 years ago
- Code for the ILNewsDiff Twitter account☆10May 23, 2023Updated 2 years ago
- Simple-to-use scoring function for arbitrarily tokenized texts.☆48Feb 19, 2025Updated last year
- Event based Sign-Language-Translation☆19Feb 27, 2026Updated last month
- TensorFlow implementation of "Generating Sentences from a Continuous Space"☆11Sep 16, 2019Updated 6 years ago
- ☆25Sep 26, 2025Updated 6 months ago
- Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet.☆13Jan 5, 2023Updated 3 years ago
- Dialogue Act classification☆18Jan 15, 2024Updated 2 years ago
- 🔄 ASCII / IPA conversion for Typst☆22Jan 8, 2026Updated 3 months ago
- Deploy open-source AI quickly and easily - Bonus Offer • AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- Collection of academic works in natural language processing, computational linguistics, and computational cognitive science that study th…☆22Mar 20, 2024Updated 2 years ago
- Code for "Guiding Large Language Models to Post-Edit Machine Translation with Error Annotations" [NAACL Findings 2024]☆15Apr 3, 2026Updated last week
- Statistics on multilingual datasets☆17Jul 12, 2022Updated 3 years ago
- Official repository for "BLEUBERI: BLEU is a surprisingly effective reward for instruction following"☆32Jun 5, 2025Updated 10 months ago
- Materials for LOT School 2023, "Language Learning: A Data-Driven Approach"☆14Aug 14, 2024Updated last year
- EEG-MI signal classification DL model.☆14Apr 26, 2024Updated last year
- ☆18Feb 4, 2025Updated last year
- BPE modification that implements removing of the intermediate tokens during tokenizer training.☆27Nov 25, 2024Updated last year
- (BMVC2021, Oral) The repository offers the official implementation of our BMVC 2021 paper (oral) in PyTorch.☆18Apr 22, 2022Updated 3 years ago
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- Three little Python scripts for data preparation: remove commas, add commas, concatenate files☆16Jul 26, 2017Updated 8 years ago
- Code for SaGe subword tokenizer (EACL 2023)☆28Nov 30, 2024Updated last year
- This is a Utrecht University dissertation template for LaTeX☆22Jul 31, 2025Updated 8 months ago
- An updated version of the Parser-v1 repo, used for Stanford's submission in the CoNLL17 shared task.☆45Aug 15, 2018Updated 7 years ago
- Find informative examples to efficiently (human)-evaluate NLG models.☆18Feb 27, 2026Updated last month
- ☆13Apr 16, 2021Updated 5 years ago
- Code repository for the paper "MrT5: Dynamic Token Merging for Efficient Byte-level Language Models."☆56Sep 25, 2025Updated 6 months ago