out0fmemory / GuozhongCrawlerLinks

GuozhongCrawler的是一个无须配置、便于二次开发的爬虫开源框架，它提供简单灵活的API，只需少量代码即可实现一个爬虫。其设计灵感来源于多个爬虫国内外爬虫框架的总结。采用完全模块化的设计，功能覆盖整个爬虫的生命周期(链接提取、页面下载、内容抽取、持久化)，支持多线程抓取，分布式抓取，并支持自动重试，定制执行js、自定义cookie等功能。在处理网站抓取多次后被封IP的问题上，guozhongCrawler采用动态轮换IP机制有效防止IP被封。另外，源码中的注释及Log输出全部采用通俗易懂的中文。让初学者能有更加深刻的理解

☆96

Alternatives and similar repositories for GuozhongCrawler

Users that are interested in GuozhongCrawler are comparing it to the libraries listed below

Sorting:

wo4li2wang / MSpider
基于词频密度过滤、利用百度、谷歌、搜搜、360搜索4个引擎为种子来源的多线程爬虫，结果存入mysql。
☆97Updated 11 years ago
javagaorui5944 / ProxyIpPool
The Crawler Proxy IP Pool Component
☆64Updated 2 years ago
QiuMing / zhihuWebSpider
知乎爬虫，基于webmagic框架 .A java web spider base on webmagic.
☆69Updated 9 years ago
badaozhai / wechat_webdriver_spider
java 基于selenium抓取搜狗微信公众号文章
☆50Updated 9 years ago
zyongjava / spider
利用spring boot + webmagic 开发的java爬虫系统
☆61Updated 8 years ago
shenbaise / goodcrawler
网络爬虫
☆52Updated 11 years ago
denghuichao / proxy-pool
爬虫代理IP池服务，可供其他爬虫程序通过restapi获取
☆113Updated 2 years ago
hxyfj / LagouSpider
拉勾网数据爬虫
☆32Updated 7 years ago
gsh199449 / DistributeCrawler
基于Map/Reduce爬虫,可抽取各大新闻网站的新闻正文并进行分类和聚类
☆74Updated 11 years ago
ysc / rank
rank是一个seo工具，用于分析网站的搜索引擎收录排名。
☆67Updated 8 years ago
JFanZhao / spider
使用java+httpclient+httpcleaner，多线程、分布式爬去电商网站商品信息，数据存储在hbase上，并使用solr对商品建立索引，使用redis队列存储一个共享的url仓库；使用zookeeper对爬虫节点生命周期进行监视等。
☆232Updated 4 years ago
liyifeng1994 / xfshxzs
小锋生活小助手——JAVA开发的基于爬虫和API实现的查询类微信公众号
☆31Updated 7 years ago
thegodofwar / Spider
利用HttpClient4+实现网络小说爬虫，可动态添加热门的小说网站
☆31Updated 12 years ago
letcheng / ProxyPool
针对反爬虫问题的自动代理池组件
☆78Updated 8 years ago
xautlx / nutch-htmlunit
基于Apache Nutch和Htmlunit的扩展实现AJAX页面爬虫抓取解析插件
☆124Updated 10 years ago
justinscript / shopping.plat
社交化导购平台
☆37Updated 6 years ago
tigerxue / ghost-login
☆27Updated 8 years ago
liyifeng1994 / webmagic-csdnblog
基于WebMagic写的一个csdn博客小爬虫
☆91Updated 7 years ago
yangchenjava / com.yangc.utils
工作中积累的工具类
☆85Updated 7 years ago
hotstu / javaCaptcha
java 验证码识别 svm
☆34Updated 10 years ago
pumadong / cl-member
会员管理系统：包含网站中的会员中心，后台的会员管理功能，提供给其他系统的会员API，以及会员相关的自动化任务。
☆95Updated 10 years ago
CoolAcsi / baidu-chain-dog
百度莱茨狗爬虫。
☆51Updated 7 years ago
yinchuandong / JavaVerify
A Java CAPTCHA recognition library for sticky characters
☆207Updated 10 years ago
Javen205 / jfinal_qyweixin
jfinal_qyweixin是基于 jfinal-weixin二次开发而来，只需浏览 Demo 代码即可进行极速开发。同时支持微信企业号以及企业微信
☆81Updated 8 years ago
wgybzbrobot / sentiment-search
舆情搜索服务框架，其中lucene和solr版本为4.8。
☆61Updated 9 years ago
daikaixian / JobHunter
抓取拉勾，内推，智联招聘，前程无忧等网站的招聘信息，格式化存储，图表化展示。
☆68Updated 5 years ago
qianlicao51 / douyin
抖音视频抓取
☆79Updated 7 years ago
luowei / lamps-sell
一个功能类似淘宝商城的销售门户网站,包含前台商品分类展示查询,搜索,用户注册评价留言,下订单,查询订单;后台用户角色权限,商品,厂家,留言评价,新闻广告订单管理...
☆97Updated 12 years ago
xiaoyang611 / crawler-denfender
反网页爬虫系统
☆39Updated 10 years ago
dhengyi / ip-proxy-pools-regularly
实现定时爬取与IP代理池
☆148Updated 7 years ago