trendsci / linkrunLinks
LinkRun - Data Engineering project done in 3 weeks during the Insight fellowship
☆38Updated 5 years ago
Alternatives and similar repositories for linkrun
Users that are interested in linkrun are comparing it to the libraries listed below
Sorting:
- Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends☆56Updated last year
- Source real estate prices from the Common Crawl.☆27Updated 6 years ago
- Cloud crawler functions for scrapeulous☆45Updated 4 years ago
- Index Common Crawl archives in tabular format☆120Updated 3 weeks ago
- Various Jupyter notebooks about Common Crawl data☆53Updated 2 months ago
- Google Cloud Storage connector, pre-processor and model for predicting user search intent based on keywords☆25Updated 5 years ago
- Curated list of awesome software and resources for Senzing, The First Real-Time AI for Entity Resolution.☆58Updated last month
- ☆28Updated 4 years ago
- Scrape all the pages and links of a given domain and write the results to Google Cloud BigQuery.☆39Updated 4 years ago
- Spin up Tor containers and then proxy HTTP requests via these Tor instances☆43Updated 4 years ago
- Example Flask project to use Spacy on AWS Lambda and get the models from an S3 bucket☆12Updated 2 years ago
- A curated list of promising Web Data Extractors resources☆28Updated 5 years ago
- Content Extraction using the PageRank algorithm to find the element containing the best content.☆12Updated 5 years ago
- Streamlit application to keep GPT3 Experimentation sane☆23Updated 3 years ago
- a Hadoop Map Reduce application that retrieves data/articles related to sports from sources like NY Times, Commoncrawl, and Twitter and c…☆12Updated 5 years ago
- Extract social media links and account names from websites.☆38Updated 4 years ago
- 2015 CrunchBase Data Export as CSV☆161Updated 9 years ago
- A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/☆191Updated 6 years ago
- Pre-built template for using newspaper3k on aws lambda☆17Updated 2 years ago
- ☆11Updated 4 years ago
- A python utility for downloading Common Crawl data☆240Updated last year
- Open source, privacy focused client side library for the creation and monetisation of online audiences.☆55Updated last year
- Using ML to extract campaign finance data from messy forms for journalism☆76Updated 2 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆46Updated 7 years ago
- Matches a category of Google's Taxonomy to product that is described in any kind of text data☆62Updated 6 years ago
- Now included in rigour☆151Updated last month
- API for OpenSanctions with support for entity search and bulk matching of data collections. Supports Reconciliation API spec.☆84Updated this week
- Deployment of pywb as a CommonCrawl Index Server☆21Updated 7 years ago
- Integrate Watson Studio and Watson Campaign Automation to tailor your target audience for effective campaigns☆12Updated 3 years ago
- A search engine for Open Data☆53Updated 2 years ago