tokenmill / crawling-frameworkLinks
Easily crawl news portals or blog sites using Storm Crawler.
☆21Updated 3 years ago
Alternatives and similar repositories for crawling-framework
Users that are interested in crawling-framework are comparing it to the libraries listed below
Sorting:
- Trying to generate name synonyms from wikidata☆34Updated 5 years ago
- Java port of SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm☆67Updated 4 months ago
- A natural language search microservice☆96Updated 4 years ago
- Advanced desktop search/corpus exploration prototype☆21Updated 4 years ago
- Watchman: An open-source social-media event-detection system☆21Updated 7 years ago
- A machine learning tool for fishing entities☆263Updated 5 months ago
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆187Updated last week
- A fast and simple JavaScript library specifically targeted at collecting search and search related browser events.☆43Updated last year
- Index Common Crawl archives in tabular format☆122Updated 2 weeks ago
- CubeQA—Question Answering on Statistical Linked Data☆21Updated last month
- This page is a companion for the paper titled Towards Automatic Structuring and Semantic Indexing of Legal Documents☆29Updated last week
- Search relevance evaluation toolkit☆74Updated 3 years ago
- Search Quality Evaluation Tool for Apache Solr & Elasticsearch search-based infrastructures☆192Updated this week
- tool for collectively summarizing large discussions☆145Updated 2 years ago
- API definition, resources and reference implementation of URL Frontiers☆53Updated this week
- Beagle helps you identify keywords, phrases, regexes, and complex search queries of interest in streams of text documents.☆54Updated 4 years ago
- Record Linkage ToolKit (Find and link entities)☆109Updated 2 years ago
- Federated Knowledge Extraction Framework☆193Updated 2 years ago
- Open Source REST API for named entity extraction, named entity linking, named entity disambiguation, recommendation & reconciliation of e…☆197Updated 3 years ago
- CoCrawler is a versatile web crawler built using modern tools and concurrency.☆191Updated 3 years ago
- DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…☆52Updated 5 years ago
- Analyze and extract Wikipedia article text and attributes and store them into an ElasticSearch index or to json files (multilingual suppo…☆47Updated 2 years ago
- WInte.r is a Java framework for end-to-end data integration. The WInte.r framework implements well-known methods for data pre-processing,…☆111Updated 3 years ago
- Table Linker☆22Updated 3 years ago
- Improve your OpenSearch, Elasticsearch, Solr, Vectara, Algolia and Custom Search search quality.☆326Updated this week
- Solr Query Segmenter for structuring unstructured queries☆22Updated 4 years ago
- API - extract a list of keywords from a text.☆18Updated 8 years ago
- Java library for reading and writing WARC files with a typed API☆50Updated last month
- Towards an open source stack for e-commerce search☆150Updated last month
- A Named-Entity Recogniser based on Grobid.☆54Updated 6 months ago