chrislinan/cx-extractor-python

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/chrislinan/cx-extractor-python)

chrislinan / cx-extractor-python

基于行块分布函数的通用网页正文抽取算法的Python版本实现，添加了英文支持/ Web page content extraction algorithm, support both Chinese and English

☆482

Alternatives and similar repositories for cx-extractor-python

Users that are interested in cx-extractor-python are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

GeneralNewsExtractor / GeneralNewsExtractor
View on GitHub
新闻网页正文通用抽取器 Beta 版.
☆3,788Apr 21, 2026Updated 2 months ago
reorx / cx-extractor
View on GitHub
Automatically exported from code.google.com/p/cx-extractor
☆29Apr 1, 2015Updated 11 years ago
chrislinan / cx-extractor
View on GitHub
基于行块分布函数的通用网页正文抽取，C#版本
☆28Sep 28, 2015Updated 10 years ago
fancyspeed / sf-extractor
View on GitHub
Html content extractor: cx-extractor in python and sf-extractor
☆18Apr 18, 2016Updated 10 years ago
rainyear / cix-extractor-py
View on GitHub
基于行块分布函数的通用网页正文（及图片）抽取 - Python版本
☆114Sep 22, 2016Updated 9 years ago
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
zezhix / html-extractor
View on GitHub
基于行块分布函数的通用网页正文抽取算法优化，Python实现
☆61Feb 17, 2020Updated 6 years ago
LeetaoGoooo / MovieHeavens
View on GitHub
🎬 基于Pyqt5的简单电影搜索工具
☆654Oct 11, 2022Updated 3 years ago
MikeChongCan / scylla
View on GitHub
Intelligent proxy pool for Humans™ to extract content from the internet and build your own Large Language Models in this new AI era
☆4,017Jun 9, 2025Updated last year
xianhu / PSpider
View on GitHub
简单易用的Python爬虫框架，QQ交流群：597510560
☆1,837Jun 10, 2022Updated 4 years ago
jtyoui / PyUnit
View on GitHub
搜狗词库下载、新词发现算法、常见的工具类、百度应用、翻译、天气预报、汉语纠错、字符串文本数据提取时间解析、百度文库下载、实体抽取等等
☆726Mar 24, 2022Updated 4 years ago
chatopera / Synonyms
View on GitHub
中文近义词：聊天机器人，智能问答工具包
☆5,107Feb 1, 2026Updated 5 months ago
chyroc / WechatSogou
View on GitHub
基于搜狗微信搜索的微信公众号爬虫接口
☆6,345Mar 7, 2026Updated 4 months ago
fate0 / getproxy
View on GitHub
getproxy 是一个抓取发放代理网站，获取 http/https 代理的程序
☆830Aug 2, 2022Updated 3 years ago
stanzhai / Html2Article
View on GitHub
Html网页正文提取
☆496May 9, 2022Updated 4 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
Gerapy / GerapyAutoExtractor
View on GitHub
Auto Extractor Module
☆338Aug 19, 2024Updated last year
antct / city-vein
View on GitHub
Urban structure characterized by public lines
☆776May 19, 2026Updated 2 months ago
SpiderClub / haipproxy
View on GitHub
High available distributed ip proxy pool, powerd by Scrapy and Redis
☆5,535Dec 26, 2022Updated 3 years ago
grangier / python-goose
View on GitHub
Html Content / Article Extractor, web scrapping lib in Python
☆4,100Mar 10, 2026Updated 4 months ago
codelucas / newspaper
View on GitHub
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
☆15,114Jul 8, 2026Updated last week
hangsz / pandas-tutorial
View on GitHub
适合初级到中级晋升者，有了体系之后就看熟练度了。
☆1,890Mar 30, 2024Updated 2 years ago
SpiderClub / weibospider
View on GitHub
A distributed crawler for weibo, building with celery and requests.
☆4,794Jul 11, 2020Updated 6 years ago
zeromicro / ddl-parser
View on GitHub
A tool to parse mysql ddl.
☆15Jun 14, 2023Updated 3 years ago
StrongBoy998 / CrawlArticle
View on GitHub
基于文字密度的新闻正文提取模块，兼容python2和python3，传入新闻网址或者网页源码即可返回标题，发布时间和正文内容。
☆14Jun 10, 2018Updated 8 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
my8100 / scrapydweb
View on GitHub
Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI.…
☆3,409Feb 19, 2025Updated last year
flaggo / pydu
View on GitHub
Useful data structures and utils for Python.
☆338Jun 17, 2026Updated last month
gyh1621 / GetSubtitles
View on GitHub
一步下载匹配字幕
☆741Jul 13, 2020Updated 6 years ago
howie6879 / ruia
View on GitHub
Async Python 3.6+ web scraping micro-framework based on asyncio
☆1,739Jul 1, 2023Updated 3 years ago
TGmeetup / TGmeetup
View on GitHub
A collection set of technical groups' information (meetup).
☆147Nov 1, 2020Updated 5 years ago
Denon / syncPlaylist
View on GitHub
sync playlist between music platform
☆239Jan 21, 2018Updated 8 years ago
fxsjy / jparser
View on GitHub
A readability parser which can extract title, content, images from html pages
☆86May 29, 2020Updated 6 years ago
zdict / zdict
View on GitHub
The last online dictionary CLI framework you need.
☆632Jun 24, 2023Updated 3 years ago
Kr1s77 / awesome-python-login-model
View on GitHub
😮python模拟登陆一些大型网站，还有一些简单的爬虫，希望对你们有所帮助❤️，如果喜欢记得给个star哦🌟
☆16,231Jul 26, 2022Updated 3 years ago
Bare Metal GPUs on DigitalOcean Gradient AI • Ad
Purpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
jhao104 / proxy_pool
View on GitHub
Python ProxyPool for web spider
☆23,499Jun 15, 2026Updated last month
letiantian / TextRank4ZH
View on GitHub
从中文文本中自动提取关键词和摘要
☆3,396May 7, 2025Updated last year
NtesEyes / pylane
View on GitHub
An python vm injector with debug tools, based on gdb.
☆359Nov 6, 2022Updated 3 years ago
kingking888 / CommNewsExtractor
View on GitHub
通用文章提取，正文，标题，时间，作者，图片，音视频，联系方式等
☆23Mar 19, 2023Updated 3 years ago
fake-useragent / fake-useragent
View on GitHub
Up-to-date simple useragent faker with real world database
☆4,054Mar 29, 2026Updated 3 months ago
hee0624 / extract_news
View on GitHub
Python package to parse news from various news website
☆13Sep 19, 2018Updated 7 years ago
goose3 / goose3
View on GitHub
A Python 3 compatible version of goose http://goose3.readthedocs.io/en/latest/index.html
☆913Jun 22, 2026Updated 3 weeks ago