tokenmill / crawling-framework
Easily crawl news portals or blog sites using Storm Crawler.
☆20Updated last year
Related projects ⓘ
Alternatives and complementary repositories for crawling-framework
- Integration between Reaction ECommerce and Accelerated Text to provide product descriptions for an e-shop.☆9Updated 3 years ago
- A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/☆181Updated 6 years ago
- Index Common Crawl archives in tabular format☆106Updated 2 weeks ago
- Advanced desktop search/corpus exploration prototype☆21Updated 3 years ago
- AmbiverseNLU: A Natural Language Understanding suite by Max Planck Institute for Informatics☆208Updated 10 months ago
- Java port of SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm☆64Updated 3 years ago
- Dice.com repo to accompany the dice.com 'Vectors in Search' talk by Simon Hughes, from the Activate 2018 search conference, and the 'Sear…☆85Updated 3 years ago
- Trying to generate name synonyms from wikidata☆33Updated 4 years ago
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆159Updated last month
- Search relevance evaluation toolkit☆30Updated 2 years ago
- A machine learning tool for fishing entities☆245Updated last month
- Source code for the paper "Web2Text: Deep Structured Boilerplate Removal", full paper @ ECIR'18☆167Updated 3 years ago
- DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…☆50Updated 4 years ago
- A fast and simple JavaScript library specifically targeted at collecting search and search related browser events.☆41Updated 2 months ago
- Reading legal authority for the last time☆34Updated 5 months ago
- Graph databases, Knowledge Graphs, SPARQ☆75Updated 3 years ago
- This page is a companion for the paper titled Towards Automatic Structuring and Semantic Indexing of Legal Documents☆28Updated 6 years ago
- Lightning fast spell correction / fuzzy search library based on SymSpell by Commerce-Experts☆80Updated 6 years ago
- LexPredict Legal Dictionaries☆111Updated 2 years ago
- API definition, resources and reference implementation of URL Frontiers☆45Updated last week
- A text tagger based on Lucene / Solr, using FST technology☆174Updated 10 months ago
- Index URLs in Common Crawl☆193Updated 7 years ago
- Search relevance evaluation toolkit☆73Updated 2 years ago
- WInte.r is a Java framework for end-to-end data integration. The WInte.r framework implements well-known methods for data pre-processing,…☆110Updated 2 years ago
- LexPredict ContraxSuite document samples☆21Updated 6 years ago
- WARC and ARC indexing and discovery tools.☆116Updated 3 months ago
- Improve your Elasticsearch, OpenSearch, Solr, Vectara, Algolia and Custom Search search quality.☆284Updated this week
- Disambiguation of Semantic Resources - Full framework☆30Updated 8 years ago
- Various utilities regarding Levenshtein transducers. (Java)☆56Updated 2 years ago