epfl-dlab / homepage2vec
Language-Agnostic Website Embedding and Classification
☆41Updated last year
Alternatives and similar repositories for homepage2vec:
Users that are interested in homepage2vec are comparing it to the libraries listed below
- Code for Relevance-guided Supervision for OpenQA with ColBERT (TACL'21)☆41Updated 3 years ago
- The official code for PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization☆157Updated 2 years ago
- ☆75Updated 3 years ago
- MultiCite code and data. Models are available on Huggingface.☆31Updated 2 years ago
- SciWING is a modern toolkit for scientific document processing from WING-NUS☆63Updated last year
- Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.☆31Updated last year
- The CleanCoNLL dataset from our EMNLP 2023 paper where we corrected annotation errors and inconsistencies in CoNLL-03.☆23Updated 8 months ago
- No Parameter Left Behind: How Distillation and Model Size Affect Zero-Shot Retrieval☆29Updated 2 years ago
- Code accompanying the submission "Structural Text Segmentation of Legal Documents" by Aumiller et al.☆96Updated last year
- Annotated corpus + evaluation metrics for text anonymisation☆55Updated last year
- 💫 SpaCy wrapper for ConceptNet 💫☆90Updated last year
- StAtutory Reasoning Assessment☆13Updated 2 years ago
- Data and additional information regarding the paper: Contract Discovery. Dataset and a Few-Shot Semantic Retrieval Challenge with Competi…☆30Updated 4 years ago
- Legal document classification with EuroVoc descriptors on 22 languages.☆25Updated last year
- Incorporating VIsual LAyout Structures for Scientific Text Classification☆175Updated 2 years ago
- French Machine Reading for Question Answering☆18Updated 2 years ago
- Code for equipping pretrained language models (BART, GPT-2, XLNet) with commonsense knowledge for generating implicit knowledge statement…☆16Updated 3 years ago
- Reimplementation of a BERT based model (Shi et al, 2019), currently the state-of-the-art for English SRL. This model implements also pred…☆70Updated 3 years ago
- A dataset of atomic wikipedia edits containing insertions and deletions of a contiguous chunk of text in a sentence. This dataset contai…☆106Updated 5 years ago
- Examples for aligning, padding and batching sequence labeling data (NER) for use with pre-trained transformer models☆65Updated 2 years ago
- This repository contains the code for the paper 'PARM: Paragraph Aggregation Retrieval Model for Dense Document-to-Document Retrieval' pu…☆40Updated 3 years ago
- ☆28Updated 3 months ago
- ☆18Updated 2 years ago
- Repository for Zheng and Guha et al., 2021, "When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Data…☆86Updated last year
- Repository for the paper "Named Entity Recognition for Entity Linking: What Works and What's Next" (EMNLP 2021).☆75Updated 3 years ago
- The NewSHead dataset is a multi-doc headline dataset used in NHNet for training a headline summarization model.☆37Updated 3 years ago
- Mr. TyDi is a multi-lingual benchmark dataset built on TyDi, covering eleven typologically diverse languages.☆74Updated 3 years ago
- Summary Explorer is a tool to visually explore the state-of-the-art in text summarization.☆44Updated 10 months ago
- multimodal document analysis☆164Updated 9 months ago
- [LREC 2022] An off-the-shelf pre-trained Tweet NLP Toolkit (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Tweeban…☆104Updated last year