tokenmill / crawling-framework
Easily crawl news portals or blog sites using Storm Crawler.
☆20Updated 2 years ago
Related projects ⓘ
Alternatives and complementary repositories for crawling-framework
- Easily run HTTP GET requests against a list of URLs to check their HTTP status.☆12Updated 5 years ago
- Integration between Reaction ECommerce and Accelerated Text to provide product descriptions for an e-shop.☆9Updated 3 years ago
- Beagle helps you identify keywords, phrases, regexes, and complex search queries of interest in streams of text documents.☆52Updated 3 years ago
- LegalCrawler: A tool for automated scraping of English legal corpora☆48Updated 2 years ago
- A natural language search microservice☆96Updated 3 years ago
- API definition, resources and reference implementation of URL Frontiers☆46Updated this week
- Java port of SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm☆65Updated 3 years ago
- Reading legal authority for the last time☆34Updated 6 months ago
- A simple web application for searching Word2Vec embeddings derived from approximately 2,000 law reports published by the The Incorporated…☆25Updated 2 years ago
- A fast and simple JavaScript library specifically targeted at collecting search and search related browser events.☆41Updated 3 months ago
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆159Updated last month
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆42Updated 5 years ago
- API - extract a list of keywords from a text.☆18Updated 7 years ago
- ☆18Updated 3 years ago
- Index Common Crawl archives in tabular format☆106Updated this week
- Advanced desktop search/corpus exploration prototype☆21Updated 3 years ago
- CoCrawler is a versatile web crawler built using modern tools and concurrency.☆187Updated 2 years ago
- Find legal citations in any block of text☆123Updated 4 months ago
- Trying to generate name synonyms from wikidata☆33Updated 4 years ago
- A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/☆181Updated 6 years ago
- An opensource TAR framework for experiments and applications☆16Updated 8 months ago
- Dice.com repo to accompany the dice.com 'Vectors in Search' talk by Simon Hughes, from the Activate 2018 search conference, and the 'Sear…☆85Updated 3 years ago
- LexPredict Legal Dictionaries☆111Updated 2 years ago
- Socrates is a thin wrapper around an early-stage [AllenNLP](https://allennlp.org/) model that enables machine reading comprehension (MRC)…☆14Updated 3 years ago
- A machine learning tool for fishing entities☆249Updated last week
- Judgment citation annotations for the National Archives Find Case Law service☆22Updated this week
- Python/Django based webapps and web user interfaces for search, structure (meta data management like thesaurus, ontologies, annotations a…☆95Updated 2 years ago
- NLP framework: sentence detector, tokeniser, pos-tagger and dependency parser☆49Updated 11 months ago
- CLI for loading Wikidata subsets (or all of it) into Elasticsearch☆67Updated 2 years ago
- A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any othe…☆65Updated 2 years ago