gsh199449/DistributeCrawler

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/gsh199449/DistributeCrawler)

gsh199449 / DistributeCrawler

基于Map/Reduce爬虫,可抽取各大新闻网站的新闻正文并进行分类和聚类

☆73

Alternatives and similar repositories for DistributeCrawler

Users that are interested in DistributeCrawler are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

FrankXiong / cqunews-web
View on GitHub
利用Java网络爬虫爬取重庆大学新闻网站数据，依据解析的数据构建的新闻网站
☆11Mar 7, 2016Updated 10 years ago
yangguang2014 / distributedCrawler
View on GitHub
华南理工大学高英实验室进行的分布式爬虫项目,除了实验室内部人员外,不得私自传播.
☆21Jul 13, 2014Updated 12 years ago
gsh199449 / DistributedCrawler
View on GitHub
DistributeCrawler的Maven版
☆10Jun 20, 2022Updated 4 years ago
Harhao / toutiao
View on GitHub
今日头条科技新闻接口爬虫
☆17Sep 26, 2017Updated 8 years ago
CrawlScript / WeiboLoginTool
View on GitHub
基于WebCollector的新浪微博爬虫及相关登录工具，如新浪微博Cookie获取
☆14Nov 21, 2018Updated 7 years ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
out0fmemory / GuozhongCrawler
View on GitHub
GuozhongCrawler的是一个无须配置、便于二次开发的爬虫开源框架，它提供简单灵活的API，只需少量代码即可实现一个爬虫。其设计灵感来源于多个爬虫国内外爬虫框架的总结。采用完全模块化的设计，功能覆盖整个爬虫的生命周期(链接提取、页面下载、内容抽取、持久化)，支持多线…
☆103Apr 20, 2015Updated 11 years ago
yinchuandong / DistributedCrawler
View on GitHub
java分布式爬虫，主机和从机控制的机制
☆14May 21, 2015Updated 11 years ago
tankle / newscrawler
View on GitHub
新闻网站爬虫,目前能够爬取网易，新浪，qq，搜狐等三家网站的新闻页面，并保存到本地。
☆34Jun 12, 2015Updated 11 years ago
xbynet / crawler
View on GitHub
A simple and flexible web crawler framework for java.
☆19Apr 22, 2018Updated 8 years ago
Glacier759 / newsEyeSpider
View on GitHub
抓取各报社报纸信息－采用配置文件形式实现的一个简单的可定制爬虫
☆11Sep 1, 2022Updated 3 years ago
peopleindreamdontsleep / SparkanSpider
View on GitHub
java爬虫，反爬虫策略、ETL清洗数据，以及spark离线和实时分析新闻并存入ES
☆19Nov 26, 2018Updated 7 years ago
tbwork / alipay_edit_typer
View on GitHub
Just a DEMO to demonstrate how to use JNA to type chars into alipay's password edit control automatically.
☆12Dec 21, 2017Updated 8 years ago
ixiaoguo / qqtea.php
View on GitHub
QQ Tea 加/解密算法之PHP实现
☆12Nov 29, 2016Updated 9 years ago
DMinerJackie / JewelCrawler
View on GitHub
豆瓣电影爬虫——a crawler which is able to crawl movie detail and short comments, save them to database mysql, also include Sentiment analysis ba…
☆69Mar 24, 2019Updated 7 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
huayonglun / SimulateLogin
View on GitHub
模拟登录的Java爬虫实现
☆12Aug 6, 2016Updated 9 years ago
codelibs / fess-crawler
View on GitHub
Web/FileSystem Crawler Library
☆39Jul 9, 2026Updated 2 weeks ago
l294265421 / cx-extractor-1.1
View on GitHub
《基于行块分布函数的通用网页正文抽取》算法的Java实现；算法代码来源于该算法附带的开源实现，不过接下可能会对之修改。
☆16Oct 29, 2015Updated 10 years ago
JFanZhao / spider
View on GitHub
使用java+httpclient+httpcleaner，多线程、分布式爬去电商网站商品信息，数据存储在hbase上，并使用solr对商品建立索引，使用redis队列存储一个共享的url仓库；使用zookeeper对爬虫节点生命周期进行监视等。
☆236Nov 6, 2020Updated 5 years ago
xiaoyang611 / crawler-denfender
View on GitHub
反网页爬虫系统
☆39Mar 10, 2015Updated 11 years ago
Flowingsun007 / house_spider
View on GitHub
Lianjia house spider链家二手房爬虫~ Springboot + Webmagic + Mysql + Redis
☆27Apr 22, 2021Updated 5 years ago
xuziping / wx-crawl
View on GitHub
微信公众号文章爬虫
☆43Sep 1, 2022Updated 3 years ago
cmqiong / vue-upload-cropper
View on GitHub
结合 ELUpload + vue-cropper 进行图片初始化渲染，图片上传裁剪的封装
☆27Jun 26, 2018Updated 8 years ago
duoan / codes-scratch-crawler
View on GitHub
读书笔记《自己动手写网络爬虫》，自己敲的代码。主要记录了网络爬虫的基本实现，网页去重的算法，网页指纹算法，文本信息挖掘
☆47Jan 9, 2015Updated 11 years ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
orangeMask / spider
View on GitHub
抖音,淘宝系,常见新闻爬虫
☆13Apr 15, 2022Updated 4 years ago
Alpha-su / dbpolicy_crawl
View on GitHub
一个新闻政策类爬虫项目，实现上万网站的实时监控、爬取、过滤、存储，具有高可用性和可扩展性。
☆41Oct 12, 2022Updated 3 years ago
124608760 / newswebsite
View on GitHub
包括一个新闻网站的首页、分类页、详情页、登录页、注册页、个人资料页。这些将作为第一个练手django项目的模板
☆11Apr 19, 2017Updated 9 years ago
kun368 / ACManager
View on GitHub
ACM Training Management System of SDUST
☆28May 25, 2018Updated 8 years ago
wujiuye / QQJoinGroup
View on GitHub
qq加群机器人，根据配置的关键词来搜索群并自动发送加群验证。难点：list滚动需要跨进程模拟触屏事件。使用前提：需要获取root权限，如需要获取更多机型的支持，需要添加相应机型的模拟触屏实现类。本项目不再维护，只提供给个人开发者学习使用。
☆15Jul 23, 2018Updated 8 years ago
chenkai1100 / SpiderFrame
View on GitHub
分布式网络爬虫架构
☆16Sep 26, 2016Updated 9 years ago
chenxiaopan / jboa
View on GitHub
办公自动化（maven+spring+springmvc+mybatis）本项目分为信息管理、邮件管理、考勤管理、权限管理四个模块。项目使用使用阿里巴巴连接池druid，使用Shiro作为安全框架邮件管理模块分为写邮件、收邮件、垃圾邮件三个板块，写邮件实现了文件上传…
☆26Jan 17, 2017Updated 9 years ago
crystal-tensor / spide
View on GitHub
网络爬虫主要抓取的是股票数据，外汇数据，股票背景资料，股票及时新闻
☆13Aug 13, 2018Updated 7 years ago
stephenluu / proxyIpCrawler
View on GitHub
抓取代理ip，保存有效可用的代理ip
☆14Aug 22, 2014Updated 11 years ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
jiangyuanyuan / lotterySpider
View on GitHub
Based on the Scrapy framework, crawling crawlers ------------------ 基于Scrapy 框架开发抓取新闻的爬虫 -------------
☆13Jul 26, 2019Updated 7 years ago
sunshineclt / n-gram
View on GitHub
Sina News Crawler and Word Segmentation
☆13Dec 20, 2017Updated 8 years ago
striver-ing / distributed-spider
View on GitHub
通用新闻类网站分布式爬虫
☆79Jul 17, 2018Updated 8 years ago
madpudding / RelationshipCrawler
View on GitHub
知网、万方、专利局爬虫
☆11Mar 20, 2019Updated 7 years ago
chmod740 / BaiduBaikeSpider
View on GitHub
百度百科多线程爬虫Java源码，数据存储采用了Oracle11g
☆13Feb 23, 2017Updated 9 years ago
guoguicheng / webdatabase
View on GitHub
IndexedDb,web database,IndexedDb class,网页数据库封装类,网页缓存技术
☆14Oct 22, 2018Updated 7 years ago
andersonkxiass / Spring-Boot-RabbitMQ
View on GitHub
This is a project to show how to use Spring Boot and RabbitMQ.
☆11Jun 27, 2016Updated 10 years ago