leogao2/commoncrawl_downloader

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/leogao2/commoncrawl_downloader)

leogao2 / commoncrawl_downloader

☆33

Alternatives and similar repositories for commoncrawl_downloader

Users that are interested in commoncrawl_downloader are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

lemurproject / ClueWeb22
View on GitHub
☆17Dec 11, 2024Updated last year
noanabeshima / wikipedia-downloader
View on GitHub
Downloads 2020 English Wikipedia articles as plaintext
☆27Mar 25, 2023Updated 3 years ago
EleutherAI / pile-cc
View on GitHub
☆16Mar 25, 2022Updated 4 years ago
timpal0l / sts-benchmark-swedish
View on GitHub
☆11Dec 3, 2020Updated 5 years ago
EleutherAI / pyfra
View on GitHub
Python Research Framework
☆107Nov 3, 2022Updated 3 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
leogao2 / lm_dataformat
View on GitHub
☆79Dec 7, 2023Updated 2 years ago
microsoft / msmarco
View on GitHub
website for MS Marco
☆36Mar 26, 2025Updated last year
NEUIR / ConAE
View on GitHub
[EMNLP 2022] This is the code repo for our EMNLP‘22 paper "Dimension Reduction for Efficient Dense Retrieval via Conditional Autoencoder"…
☆13Oct 20, 2022Updated 3 years ago
af-ai-center / bert
View on GitHub
Code and Swedish pre-trained models for BERT
☆12Feb 5, 2020Updated 6 years ago
sdtblck / youtube_subtitle_dataset
View on GitHub
YT_subtitles - extracts subtitles from YouTube videos to raw text for Language Model training
☆47Sep 22, 2020Updated 5 years ago
xyltt / LPT
View on GitHub
This repo contains the code for Late Prompt Tuning.
☆12Dec 22, 2025Updated 7 months ago
EleutherAI / the-pile
View on GitHub
☆1,670Apr 27, 2023Updated 3 years ago
INK-USC / FiD-ICL
View on GitHub
"FiD-ICL: A Fusion-in-Decoder Approach for Efficient In-Context Learning" (ACL 2023)
☆15Jul 24, 2023Updated 3 years ago
ryanhlewis / GPT-Auto-Scraper
View on GitHub
A GPT-powered AI auto scraper for websites. AI Web Scraping made easy.
☆13Jun 26, 2023Updated 3 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
microsoft / ARXGEN
View on GitHub
Scripts to parse arxiv documents for NLP tasks
☆19Jun 12, 2023Updated 3 years ago
MaXuSun / domainext
View on GitHub
A PyTorch toolbox for domain adaptation, domain generalization, federated learning DA/DG, active learning DA/DG, ALDG and semi-supervised…
☆11Jan 10, 2022Updated 4 years ago
rynmurdock / CPPN-WGAN-GP
View on GitHub
A WGAN-GP that utilizes a compositional pattern producing network as the generator
☆11Sep 9, 2021Updated 4 years ago
yuhongqian / ANCE-PRF
View on GitHub
☆12May 17, 2022Updated 4 years ago
vicgalle / zero-shot-reward-models
View on GitHub
ZYN: Zero-Shot Reward Models with Yes-No Questions
☆34Aug 15, 2023Updated 2 years ago
sanowl / CoRAG
View on GitHub
this is based on the paper Chain-of-Retrieval Augmented Generation
☆15Mar 29, 2025Updated last year
EleutherAI / github-downloader
View on GitHub
Script for downloading GitHub.
☆99Jul 1, 2024Updated 2 years ago
jlira5418 / web_scraping_challenge
View on GitHub
## Step 1 - Scraping Complete your initial scraping using Jupyter Notebook, BeautifulSoup, Pandas, and Requests/Splinter. * Create a Ju…
☆11Dec 22, 2021Updated 4 years ago
chatnoir-eu / chatnoir-resiliparse
View on GitHub
A robust web archive analytics toolkit
☆144Jul 20, 2026Updated last week
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
commoncrawl / ia-web-commons
View on GitHub
Web archiving utility library
☆11Jul 21, 2026Updated last week
LeeSureman / MoT
View on GitHub
code for Preprint paper at Arxiv: MoT: Pre-thinking and Recalling Enable ChatGPT to Self-Improve with Memory-of-Thoughts
☆24Nov 29, 2023Updated 2 years ago
HanSolo9682 / CounterCurate
View on GitHub
This is the implementation of CounterCurate, the data curation pipeline of both physical and semantic counterfactual image-caption pairs.
☆19Jun 27, 2024Updated 2 years ago
edenartlab / flux-trainer
View on GitHub
Eden Flux LoRA trainer and full-finetuning
☆23Mar 21, 2025Updated last year
tongzhou21 / Oasis
View on GitHub
☆23Aug 7, 2023Updated 2 years ago
huggingface / Megatron-LM
View on GitHub
Ongoing research training transformer models at scale
☆19Jul 27, 2023Updated 3 years ago
JonathanRaiman / epub_conversion
View on GitHub
Python package for converting xml and epubs to text files
☆33Jun 9, 2020Updated 6 years ago
amirgamil / curius-search
View on GitHub
Search engine of my Curius data
☆16Apr 10, 2022Updated 4 years ago
moirage / alignment-research-dataset
View on GitHub
A dataset of alignment research and code to reproduce it
☆80Jun 22, 2023Updated 3 years ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
distdl / distdl
View on GitHub
☆21Aug 18, 2022Updated 3 years ago
mollerhoj / Scandinavian-ULMFiT
View on GitHub
The weights for the embedding layer of Scandinavian UMLFiT language models
☆32Dec 5, 2019Updated 6 years ago
swj0419 / detect-pretrain-code
View on GitHub
This repository provides an original implementation of Detecting Pretraining Data from Large Language Models by *Weijia Shi, *Anirudh Aji…
☆243Nov 3, 2023Updated 2 years ago
cliang1453 / SAGE
View on GitHub
No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models (ICLR 2022)
☆29Feb 9, 2022Updated 4 years ago
SapienzaNLP / mcl-wic
View on GitHub
Semeval-2021 Multilingual and Cross-lingual Word-in-Context Task
☆18May 27, 2021Updated 5 years ago
shammur / SemEval2022Task3
View on GitHub
The PreTENS shared task hosted at SemEval 2022 aims at focusing on semantic competence with specific attention on the evaluation of langu…
☆12Feb 5, 2022Updated 4 years ago
TimotheeMickus / codwoe
View on GitHub
The CODWOE shared task invites you to compare two types of semantic descriptions: dictionary glosses and word embedding representations. …
☆12Jul 13, 2022Updated 4 years ago