chatnoir-eu / chatnoir-resiliparseLinks
A robust web archive analytics toolkit
☆126Updated 2 months ago
Alternatives and similar repositories for chatnoir-resiliparse
Users that are interested in chatnoir-resiliparse are comparing it to the libraries listed below
Sorting:
- Pretraining Efficiently on S2ORC!☆178Updated last year
- The pipeline for the OSCAR corpus☆175Updated last month
- ☆217Updated 2 months ago
- This project studies the performance and robustness of language models and task-adaptation methods.☆155Updated last year
- Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese…☆135Updated 2 years ago
- Datasets collection and preprocessings framework for NLP extreme multitask learning☆189Updated 6 months ago
- What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets☆225Updated last year
- Manage scalable open LLM inference endpoints in Slurm clusters☆278Updated last year
- ☆38Updated last year
- 💬 Language Identification with Support for More Than 2000 Labels -- EMNLP 2023☆182Updated last month
- Python tools for processing the stackexchange data dumps into a text dataset for Language Models☆85Updated 2 years ago
- ☆92Updated 3 years ago
- ☆16Updated last year
- multimodal document analysis☆166Updated last month
- minimal pytorch implementation of bm25 (with sparse tensors)☆104Updated 2 months ago
- Multipack distributed sampler for fast padding-free training of LLMs☆203Updated last year
- Pipeline for pulling and processing online language model pretraining data from the web☆179Updated 2 years ago
- Statistics of Common Crawl monthly archives mined from URL index files☆206Updated this week
- Organize the Web: Constructing Domains Enhances Pre-Training Data Curation☆73Updated 8 months ago
- Web archiving utility library☆11Updated last month
- ☆119Updated last year
- CLIR version of ColBERT☆74Updated 6 months ago
- ☆82Updated 2 months ago
- Baguetter is a flexible, efficient, and hackable search engine library implemented in Python. It's designed for quickly benchmarking, imp…☆201Updated last year
- SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 la…☆49Updated 2 years ago
- [EMNLP 2023 Demo] fabricator - annotating and generating datasets with large language models.☆111Updated last year
- A large-scale information-rich web dataset, featuring millions of real clicked query-document labels☆345Updated last year
- ☆62Updated last year
- A Python Search Engine for Humans 🥸☆243Updated 3 weeks ago
- Model implementation for the contextual embeddings project☆39Updated 7 months ago