dell-research-harvard / NEWS-COPYLinks
Noise-robust de-duplication at scale
☆19Updated 2 years ago
Alternatives and similar repositories for NEWS-COPY
Users that are interested in NEWS-COPY are comparing it to the libraries listed below
Sorting:
- Data and code for the paper "CiteWorth: Cite-Worthiness Detection for Improved Scientific Document Understanding"☆14Updated 3 years ago
- T-Projection is a method to perform high-quality Annotation Projection of Sequence Labeling datasets.☆13Updated 2 years ago
- ☆17Updated 3 months ago
- Code and data for "Superbizarre Is Not Superb: Derivational Morphology Improves BERT's Interpretation of Complex Words"☆17Updated 4 years ago
- MultiEURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer☆40Updated 3 years ago
- ☆23Updated 11 months ago
- KIND: an Italian Multi-Domain Dataset for Named Entity Recognition☆15Updated 2 years ago
- ☆17Updated 11 months ago
- This repository contains the code for the paper 'PARM: Paragraph Aggregation Retrieval Model for Dense Document-to-Document Retrieval' pu…☆41Updated 4 years ago
- A survey of corpora for Germanic low-resource languages and dialects☆26Updated last year
- PropSegmEnt is an annotated dataset for segmenting English text into propositions, and recognizing proposition-level entailment relations…☆21Updated 3 years ago
- ☆26Updated 10 months ago
- Code for ACL 2022 paper "Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation"☆30Updated 3 years ago
- Starbucks: Improved Training for 2D Matryoshka Embeddings☆22Updated 6 months ago
- Repository for the paper "MultiNERD: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguatio…☆45Updated last year
- Repository with code for MaChAmp: https://aclanthology.org/2021.eacl-demos.22/☆90Updated 7 months ago
- ☆10Updated last year
- Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.☆31Updated 2 years ago
- Repository for Zheng and Guha et al., 2021, "When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Data…☆94Updated 2 years ago
- This is official code for the NAACL 2021 paper: "MelBERT: Metaphor Detection via Contextualized Late Interaction usingMetaphorical Identi…☆52Updated 2 years ago
- A module to compute textual lexical richness (aka lexical diversity).☆112Updated 2 years ago
- ☆31Updated 2 years ago
- Twitter dataset for 2022 Russian and Ukrainian crisis☆48Updated 3 years ago
- GisPy: A Tool for Measuring Gist Inference Score in Text https://aclanthology.org/2022.wnu-1.5/☆13Updated last year
- Can LLMs generate code-mixed sentences through zero-shot prompting?☆11Updated 2 years ago
- ☆23Updated 4 years ago
- Legal document similarity - Code, data, and models for the ICAIL 2021 paper "Evaluating Document Representations for Content-based Legal …☆32Updated 4 years ago
- BLOOM+1: Adapting BLOOM model to support a new unseen language☆74Updated last year
- [LREC 2022] An off-the-shelf pre-trained Tweet NLP Toolkit (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Tweeban…☆105Updated last year
- Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2…☆70Updated 2 years ago