zaemyung / sentsplitLinks
A flexible sentence segmentation library using CRF model and regex rules
☆29Updated last year
Alternatives and similar repositories for sentsplit
Users that are interested in sentsplit are comparing it to the libraries listed below
Sorting:
- Megatron LM 11B on Huggingface Transformers☆27Updated 4 years ago
- A tiny BERT for low-resource monolingual models☆31Updated 11 months ago
- Pytorch Implementation of EncT5: Fine-tuning T5 Encoder for Non-autoregressive Tasks☆63Updated 3 years ago
- BLOOM+1: Adapting BLOOM model to support a new unseen language☆73Updated last year
- KETOD Knowledge-Enriched Task-Oriented Dialogue☆32Updated 2 years ago
- Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP☆58Updated 3 years ago
- This tool helps automatic generation of grammatically valid synthetic Code-mixed data by utilizing linguistic theories such as Equivalenc…☆55Updated last year
- A PyTorch Implementation of the Luna: Linear Unified Nested Attention☆41Updated 4 years ago
- Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE☆18Updated 3 years ago
- Tool to fix bitexts and tag near-duplicates for removal☆31Updated 6 months ago
- The Shmoop Corpus☆17Updated 4 years ago
- NTREX -- News Test References for MT Evaluation☆85Updated last year
- ACL22 paper: Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost☆41Updated last year
- Large scale unannotated Korean corpus for unsupervised tasks. (e.g. Language modeling)☆28Updated 6 years ago
- Research code for the paper "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models"☆27Updated 3 years ago
- ☆44Updated 4 years ago
- BERT models for many languages created from Wikipedia texts☆33Updated 5 years ago
- ☆23Updated last year
- Personal information identification standard☆21Updated last year
- Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.☆83Updated 11 months ago
- A library for data streaming and augmentation☆20Updated 3 months ago
- As good as new. How to successfully recycle English GPT-2 to make models for other languages (ACL Findings 2021)☆48Updated 4 years ago
- "Why do I feel offended?" - Korean Dataset for Offensive Language Identification (EACL2023 Findings)☆15Updated 2 years ago
- FactSumm: Factual Consistency Scorer for Abstractive Summarization☆112Updated last year
- An official implementation of "BPE-Dropout: Simple and Effective Subword Regularization" algorithm.☆53Updated 4 years ago
- Pre-training BART in Flax on The Pile dataset☆22Updated 4 years ago
- ☆29Updated 3 years ago
- A simple neural truecaser written in pytorch and allennlp.☆33Updated last year
- ☆36Updated 3 years ago
- Convenient Text-to-Text Training for Transformers☆19Updated 3 years ago