commoncrawl / cc-notebooks
Various Jupyter notebooks about Common Crawl data
☆50Updated this week
Alternatives and similar repositories for cc-notebooks:
Users that are interested in cc-notebooks are comparing it to the libraries listed below
- spaCy entry points for Curated Transformers☆26Updated 4 months ago
- Index Common Crawl archives in tabular format☆110Updated 2 months ago
- Use sync mode Playwright interactively, inside a Jupyter notebook☆14Updated 2 months ago
- Documentation effort for the BookCorpus dataset☆33Updated 3 years ago
- Graph databases, Knowledge Graphs, SPARQ☆76Updated 3 years ago
- Streamlit demo app to demonstrate the features of transformers interpret with multiple models.☆25Updated 3 years ago
- TextGraphs + LLMs + graph ML for entity extraction, linking, ranking, and constructing a lemma graph☆23Updated 11 months ago
- Deployment of pywb as a CommonCrawl Index Server☆21Updated 7 years ago
- Fastlaw's purpose is to replace generic word embeddings for work on supervised machine learning NLP-tasks with legal texts.☆37Updated 5 years ago
- Tokenization across languages. Useful as preprocessing for subword tokenization.☆22Updated 2 years ago
- ☄️ Parallel and distributed training with spaCy and Ray☆53Updated last year
- 🤗 HuggingFace Inference Toolkit for Google Cloud Vertex AI (similar to SageMaker's Inference Toolkit, but for Vertex AI and unofficial)☆17Updated 10 months ago
- ☆76Updated 2 years ago
- FAMIE: A Fast Active Learning Framework for Multilingual Information Extraction☆24Updated 2 years ago
- MirrorDataGenerator is a python tool that generates synthetic data based on user-specified causal relations among features in the data. I…☆21Updated 2 years ago
- 🌸 Train floret vectors☆18Updated last year
- NSS Capstone project to use natural language modeling, classification, and information extraction to get the exact employee count values …☆15Updated 6 years ago
- ☆29Updated last year
- arXiv plain text extraction☆41Updated 2 years ago
- ☆33Updated last year
- Repository for deepdoctection tutorial notebooks☆42Updated 2 months ago
- ☆63Updated 2 months ago
- Building a Job Dataset☆21Updated 2 years ago
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆41Updated 5 years ago
- ☆68Updated 2 years ago
- A conda-smithy repository for spacy.☆14Updated 2 months ago
- Analyze trends in articles published on arXiv☆16Updated last year
- LLM plugin for clustering embeddings☆68Updated 11 months ago
- A utility for labeling clusters of text data.☆28Updated 3 years ago