laymen / CrawlerLinks

新浪微博模拟登陆（Micro-blog Sina simulated landing）和数据清洗主包括断句、标点清洗、停用词清洗（Data cleaning

☆9

Alternatives and similar repositories for Crawler

Users that are interested in Crawler are comparing it to the libraries listed below

Sorting:

liuhuanyong / SougouWordsCollector
worddict crawler and transfer for sougpuinput wordict , 搜狗输入法词库抓取与格式转换
☆25Updated 7 years ago
ownthink / Qimen
Qimen表示的是奇门遁甲之术，用于抽取各种实体的工具。
☆29Updated 5 years ago
shipengtaov / weibo_sentiment
微博粉丝情绪分析
☆44Updated 8 years ago
liuhuanyong / WeiboIndexSpyder
self complemented WeiboIndexSpyder based on Selenium ，新浪微博指数(微指数)采集，包括综合指数，移动端指数，PC端指数
☆31Updated 7 years ago
lining0806 / TextClassify2
多算法综合的文本分类系统
☆24Updated 8 years ago
yuanjie-ai / tql-Python
思维误区: 用理想模型来思考复杂现实问题
☆38Updated 4 years ago
YijunRan / Opinion-leaders-mining
本文提出一种基于应答关系来挖掘QQ群中意见领袖的方法，该方法首先构建回应词词库，然后基于Aho-Corasick算法来匹配聊天文本中的回应词数据，构建出用户应答关系的网络结构，最后使用社交网络中重要节点识别的方法来发现意见领袖。该方法对QQ群中的意见领袖发现具有较高的准确率…
☆21Updated 9 years ago
ppy2790 / weixin
微信好友爬虫，图片处理
☆49Updated 8 years ago
liuhuanyong / BaiduIndexSpyder
self complemented BaiduIndexSpyder based on Selenium , index image decode and num image transfer，基于关键词的历时百度搜索指数自动采集
☆42Updated 7 years ago
Wooden-Robot / spider-practice
☆20Updated 8 years ago
da2vin / Spider_index
爬取百度指数和阿里指数，采用selenium，存入hbase，验证码自动识别，多线程控制
☆32Updated 8 years ago
hanxlinsist / jupyter_hub
机器学习算法、可视化、数据分析的Python代码
☆34Updated 7 years ago
jackeyGao / jianshuHot
Scrapy抓取简书热门生成电子书发送到Kindle
☆31Updated 7 years ago
frywang / DataMining
对dbpedia和百科采集而来的语料进行清洗，得到合适的三元组
☆14Updated 8 years ago
liuhuanyong / LanguagePlatform
个人实现的基于Django与semantic-ui的语言计算实验平台, 功能包括自然语言综合处理,词语计算,社会热点计算,人物计算,文学画像,职位画像等社会计算功能
☆29Updated 7 years ago
ECNUdase / Seminar-Deep-Learning
《Deep Learning》阅读讨论班
☆42Updated 6 years ago
iHealth-ecnu / iHealth_crawler
iHealth 项目的内容爬虫（一个基于 python 和 MongoDB 的医疗咨询爬虫）
☆26Updated 5 years ago
lcdevelop / page-classify
机器学习文本分类器
☆46Updated 9 years ago
luzhijun / weiboSA
微博主题搜索分析，上海租房
☆115Updated 8 years ago
yesseecity / hanlp-python
把之前 hanLP-python-flask 裡面的 hanLP 單獨分出來
☆59Updated 7 years ago
MashiMaroLjc / dudulu
APIs of text mining
☆34Updated 8 years ago
Dengqlbq / ZhiHuSpider
知乎问题及答案爬虫
☆25Updated 7 years ago
multiangle / Distributed_Microblog_Spider
分布式新浪微博爬虫
☆31Updated 8 years ago
shibing624 / authorship-identification
【今日头条】文本作者身份识别比赛
☆10Updated 6 years ago
NightMarcher / zhihu-crawler
徒手实现定时爬取知乎，从中发掘有价值的信息，并可视化爬取的数据作网页展示。
☆66Updated 2 years ago
liuhuanyong / BaikeInfoExtraction
self complement of baike knowledge base info-box extraction by online analysis.基于互动百科,百度百科,搜狗百科的词条infobox结构化信息抽取,百科知识的融合
☆35Updated 7 years ago
bojone / n2n-ocr-for-qqcaptcha
an n2n ocr for qq captcha, 端到端的腾讯验证码识别
☆86Updated 7 years ago
lihait / ExtractTopicSentence
基于标题分类的主题句提取方法可描述为: 给定一篇新闻报道, 计算标题与新闻主题词集的相似度, 判断标题是否具有提示性。对于提示性标题,抽取新闻报道中与其最相似的句子作为主题句; 否则, 综合利用多种特征计算新闻报道中句子的重要性, 将得分最高的句子作为主题句。
☆40Updated 8 years ago
fxsjy / jiebademo
a demo site for jieba
☆111Updated 11 years ago
TongzheZhang / DF-competition-sogou
大数据精准营销中搜狗用户画像挖掘
☆36Updated 8 years ago