tokenmill / crawling-framework
Easily crawl news portals or blog sites using Storm Crawler.
☆21Updated 2 years ago
Alternatives and similar repositories for crawling-framework:
Users that are interested in crawling-framework are comparing it to the libraries listed below
- Easily run HTTP GET requests against a list of URLs to check their HTTP status.☆12Updated 5 years ago
- Beagle helps you identify keywords, phrases, regexes, and complex search queries of interest in streams of text documents.☆52Updated 3 years ago
- Integration between Reaction ECommerce and Accelerated Text to provide product descriptions for an e-shop.☆12Updated 4 years ago
- API definition, resources and reference implementation of URL Frontiers☆48Updated this week
- Search relevance evaluation toolkit☆73Updated 3 years ago
- A Named-Entity Recogniser based on Grobid.☆52Updated 7 months ago
- Solr Query Segmenter for structuring unstructured queries☆21Updated 3 years ago
- A natural language search microservice☆95Updated 4 years ago
- AmbiverseNLU: A Natural Language Understanding suite by Max Planck Institute for Informatics☆210Updated last year
- WInte.r is a Java framework for end-to-end data integration. The WInte.r framework implements well-known methods for data pre-processing,…☆110Updated 2 years ago
- A machine learning tool for fishing entities☆264Updated 3 weeks ago
- Java library for reading and writing WARC files with a typed API☆48Updated 4 months ago
- Filter and format a newline-delimited JSON stream of Wikibase entities☆97Updated 6 months ago
- Search relevance evaluation toolkit☆32Updated 2 years ago
- Index Common Crawl archives in tabular format☆117Updated last month
- Multilingual library to easily parse date strings to java.util.Date objects.☆30Updated 5 years ago
- API - extract a list of keywords from a text.☆18Updated 7 years ago
- Disambiguation of Semantic Resources - Full framework☆30Updated 8 years ago
- For extracting measurements and related entities from text☆57Updated 4 years ago
- Python/Django based webapps and web user interfaces for search, structure (meta data management like thesaurus, ontologies, annotations a…☆97Updated 2 years ago
- Lightning fast spell correction / fuzzy search library based on SymSpell by Commerce-Experts☆81Updated 6 years ago
- Natural Language Parsing and Feature Generation☆38Updated 4 months ago
- Multi Tier Annotation Search☆26Updated 3 years ago
- Record Linkage ToolKit (Find and link entities)☆110Updated last year
- GROBID extension for identifying and normalizing physical quantities.☆80Updated 7 months ago
- CLI for loading Wikidata subsets (or all of it) into Elasticsearch☆70Updated 3 years ago
- Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.☆219Updated this week
- A Java UIMA-based toolbox for multilingual and efficient terminology extraction an multilingual term alignment☆40Updated 7 years ago
- SPARQL query DSL for Clojure☆21Updated 10 years ago
- WARC and ARC indexing and discovery tools.☆123Updated last month