commoncrawl / gzipstreamLinks

gzipstream allows Python to process multi-part gzip files from a streaming source

☆23

Alternatives and similar repositories for gzipstream

Users that are interested in gzipstream are comparing it to the libraries listed below

Sorting:

koursaros-ai / microservices
Neural Elastic Inference and Search
☆19Updated 5 years ago
pferrel / solr-recommender
☆16Updated 8 years ago
istresearch / traptor
Traptor -- A distributed Twitter feed
☆26Updated 2 years ago
anjishnu / Crackr
Keyword Extraction system using Brown Clustering - (This version is trained to extract keywords from job listings)
☆18Updated 10 years ago
seomoz / mozsci
Data science tools from Moz
☆23Updated 8 years ago
rossf7 / wikireverse
Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.
☆38Updated 6 years ago
bkj / wit
Algorithms for "schema matching"
☆26Updated 9 years ago
gfjreg / CommonCrawl
A distributed system for mining common crawl using SQS, AWS-EC2 and S3
☆21Updated 11 years ago
ShinNoNoir / twitterwebsearch
(BROKEN, help wanted)
☆15Updated 9 years ago
commonsense / divisi2
A Python library for learning from dimensionality reduction, supporting sparse and dense matrices.
☆78Updated 8 years ago
adamfabish / Reduction
Reduction is a python script which automatically summarizes a text by extracting the sentences which are deemed to be most important.
☆54Updated 10 years ago
proycon / python-ucto
This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet…
☆29Updated 7 months ago
nasa-jpl-memex / topic_space
Topic modeling web application
☆41Updated 10 years ago
piskvorky / gensim-simserver
[NO LONGER MAINTAINED AS OPEN SOURCE - USE SCALETEXT.COM INSTEAD]
☆108Updated 12 years ago
anthonygarvan / MinHash
☆24Updated 7 years ago
semanticize / st
Semanticizest: dump parser and client
☆20Updated 9 years ago
rodricios / crawl-to-the-future
An attempt at creating a silver/gold standard dataset for backtesting yesterday & today's content-extractors
☆35Updated 10 years ago
socialsensor / storm-focused-crawler
Collects multimedia content shared through social networks.
☆19Updated 10 years ago
mattandahalfew / Levenshtein_search
Python search module for fast approximate string matching
☆54Updated 2 years ago
rkrzr / dataset-popular
A dataset of popular pages (taken from <dir.yahoo.com>) with manually marked up semantic blocks.
☆15Updated 11 years ago
nlplab / stav
stav text annotation visualiser
☆34Updated 13 years ago
nik0spapp / sdalg
Web page segmentation and noise removal
☆55Updated last year
xtannier / WebAnnotator
WebAnnotator is a tool for annotating Web pages. WebAnnotator is implemented as a Firefox extension (https://addons.mozilla.org/en-US/fi…
☆48Updated 3 years ago
charliermarsh / semantic
A Python library for extracting semantic information from text, such as dates and numbers.
☆77Updated 3 years ago
ContinuumIO / topik
A Topic Modeling toolbox
☆92Updated 9 years ago
Lab41 / pythia
Supervised learning for novelty detection in text
☆78Updated 8 years ago
piskvorky / sim-shootout
Code for "Performance shootout between nearest-neighbour libraries": http://radimrehurek.com/2013/11/performance-shootout-of-nearest-neig…
☆99Updated 10 years ago
honnibal / text_classification
Relatively simple text classification powered by spaCy
☆41Updated 9 years ago
dedupeio / pyhacrf
Hidden alignment conditional random field for classifying string pairs.
☆24Updated last week
dedupeio / affinegap
A Cython implementation of the affine gap string distance
☆57Updated 2 years ago