dell-research-harvard / NEWS-COPYLinks
Noise-robust de-duplication at scale
☆19Updated 2 years ago
Alternatives and similar repositories for NEWS-COPY
Users that are interested in NEWS-COPY are comparing it to the libraries listed below
Sorting:
- T-Projection is a method to perform high-quality Annotation Projection of Sequence Labeling datasets.☆13Updated 2 years ago
- Data and code for the paper "CiteWorth: Cite-Worthiness Detection for Improved Scientific Document Understanding"☆14Updated 3 years ago
- ☆23Updated 10 months ago
- My NER Experiments with ModernBERT and Ettin☆26Updated 4 months ago
- Legal document similarity - Code, data, and models for the ICAIL 2021 paper "Evaluating Document Representations for Content-based Legal …☆32Updated 4 years ago
- Data for the HIPE 2022 shared task.☆21Updated 2 years ago
- ☆53Updated last year
- Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.☆31Updated 2 years ago
- Code and data for "Superbizarre Is Not Superb: Derivational Morphology Improves BERT's Interpretation of Complex Words"☆17Updated 4 years ago
- Implementation of "SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages" paper, accepted to E…☆25Updated 3 years ago
- ☆15Updated 2 months ago
- A survey of corpora for Germanic low-resource languages and dialects☆26Updated last year
- [NAACL 2022] GlobEnc: Quantifying Global Token Attribution by Incorporating the Whole Encoder Layer in Transformers☆21Updated 2 years ago
- TimeLMs: Diachronic Language Models from Twitter☆111Updated last year
- PropSegmEnt is an annotated dataset for segmenting English text into propositions, and recognizing proposition-level entailment relations…☆21Updated 2 years ago
- Repository for Zheng and Guha et al., 2021, "When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Data…☆93Updated 2 years ago
- KIND: an Italian Multi-Domain Dataset for Named Entity Recognition☆15Updated 2 years ago
- An implementation of GrASP (Shnarch et. al., 2017)☆23Updated 3 years ago
- Semantically Structured Sentence Embeddings☆69Updated last year
- Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.☆85Updated last year
- Code for "Dynamic Contextualized Word Embeddings"☆32Updated 3 years ago
- Code for ACL 2022 paper "Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation"☆30Updated 3 years ago
- This repository contains the code for the paper 'PARM: Paragraph Aggregation Retrieval Model for Dense Document-to-Document Retrieval' pu…☆41Updated 3 years ago
- ☆13Updated 3 years ago
- A python package to run inference with HuggingFace language and vision-language checkpoints wrapping many convenient features.☆28Updated last year
- Can LLMs generate code-mixed sentences through zero-shot prompting?☆11Updated 2 years ago
- ☆17Updated 10 months ago
- This repository provides the source code used to automatically generate the book summarization datasets described in the paper titled "Ec…☆11Updated 8 months ago
- Starbucks: Improved Training for 2D Matryoshka Embeddings☆22Updated 5 months ago
- Repo for Aspire - A scientific document similarity model based on matching fine-grained aspects of scientific papers.☆54Updated 2 years ago