seagatesoft/sde

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/seagatesoft/sde)

seagatesoft / sde

Structured Data Extractor. An application to extract structured data from web pages. It uses Data Extraction Based on Partial Tree Alignment (DEPTA) method. (UPDATE: I implemented a newer algorithm: https://github.com/seagatesoft/webdext)

☆50

Alternatives and similar repositories for sde

Users that are interested in sde are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

pydepta / pydepta
View on GitHub
A python implementation of DEPTA
☆84Jan 14, 2017Updated 9 years ago
seagatesoft / webdext
View on GitHub
Intelligent Web Data Extractor
☆74Dec 5, 2022Updated 3 years ago
JulianEberius / dwtc-extractor
View on GitHub
Extraction code used to create the Dresden Web Table Corpus
☆14Feb 25, 2015Updated 11 years ago
datalib / StatsCounter
View on GitHub
Python's missing statistical Swiss Army knife
☆15Aug 25, 2015Updated 10 years ago
matthewmueller / x-ray-parse
View on GitHub
x-ray's selector parser.
☆16Feb 2, 2016Updated 10 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
blaze / datafabric
View on GitHub
A distributed in-memory fabric based on shared-memory blocks and datashape. Any language can operate on the data.
☆13Feb 12, 2016Updated 10 years ago
MonetDBSolutions / MonetDBe-Java
View on GitHub
☆13Jul 8, 2024Updated 2 years ago
hiroshi-manabe / darts-clone-java
View on GitHub
A Java port of darts-clone.
☆48May 17, 2014Updated 12 years ago
numercfd / aws-fasi
View on GitHub
Failover AWS Spot Instances
☆11Dec 8, 2017Updated 8 years ago
scrapinghub / kafka-scanner
View on GitHub
High Level Kafka Scanner
☆19Sep 29, 2017Updated 8 years ago
WladimirSidorenko / CRFSuite
View on GitHub
Tree-Structured, First- and Higher-Order Linear Chain, and Semi-Markov CRFs
☆45Nov 14, 2019Updated 6 years ago
icantrap / android-dawg
View on GitHub
Implementation of a DAWG small and fast enough to work in Android apps
☆10Feb 26, 2020Updated 6 years ago
scrapinghub / mdr
View on GitHub
A python library detect and extract listing data from HTML page.
☆110May 5, 2017Updated 9 years ago
Deraen / boot-sass
View on GitHub
Boot task to compile Sass
☆16Dec 25, 2015Updated 10 years ago
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
rkrzr / dataset-popular
View on GitHub
A dataset of popular pages (taken from <dir.yahoo.com>) with manually marked up semantic blocks.
☆15Feb 9, 2014Updated 12 years ago
USCDataScience / AgePredictor
View on GitHub
Age classification from text using PAN16, blogs, Fisher Callhome, and Cancer Forum
☆18Jul 1, 2022Updated 4 years ago
scrapy / scrapely
View on GitHub
A pure-python HTML screen-scraping library
☆1,884Apr 4, 2022Updated 4 years ago
commonsearch / urlparse4
View on GitHub
Faster replacement for Python's urlparse module
☆46Apr 13, 2026Updated 3 months ago
java10000 / semantic_similarity_based_on_ANN
View on GitHub
基于人工神经网络的中文语义相似度计算研究
☆11Apr 1, 2013Updated 13 years ago
zunama / MA-FSA
View on GitHub
This is a minimal acyclic finite-state automata algorithm in Java based on the paper, "Incremental Construction of Minimal Acyclic Finite…
☆19Dec 31, 2013Updated 12 years ago
matpalm / collocations
View on GitHub
bigram / trigram analysis of wikipedia; mainly mutual info
☆22Mar 6, 2012Updated 14 years ago
jermp / s_indexes
View on GitHub
Universe-sliced indexes in C++.
☆18Jan 8, 2023Updated 3 years ago
jodaiber / semantic_compound_splitting
View on GitHub
A compound splitter based on the semantic regularities in the vector space of word embeddings.
☆16Mar 15, 2017Updated 9 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
tomlarkworthy / table_scraper
View on GitHub
☆19Sep 5, 2013Updated 12 years ago
wannabeCitizen / NN_viz
View on GitHub
For FFL Blog
☆10Sep 24, 2015Updated 10 years ago
scrapinghub / page_finder
View on GitHub
Find which links on a web page are pagination links
☆29Jan 12, 2017Updated 9 years ago
Riccorl / chinese-word-segmentation-pytorch
View on GitHub
Chinese Word Segmentation task based on BERT and implemented in Pytorch
☆14Aug 14, 2020Updated 5 years ago
zhangxiangnick / wordvec-aligned-en-zh
View on GitHub
Aligned bilingual word vectors for English and Chinese
☆11Jun 25, 2018Updated 8 years ago
vkmenon / simple-transformer
View on GitHub
Attention Is All You Need (https://arxiv.org/abs/1706.03762)
☆10Apr 26, 2018Updated 8 years ago
frankness / kb_qa
View on GitHub
时序的金融领域知识图谱构建及问答以年报为数据 jena为框架
☆11Aug 16, 2018Updated 7 years ago
andrewtrotman / JASSjr
View on GitHub
Minimalistic BM25 search engine in C/C++, Java, and nearly 20 other languages
☆21Jun 19, 2024Updated 2 years ago
skrypka / crowdflower-search
View on GitHub
Kaggle competition
☆23Jul 15, 2015Updated 11 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
scrapinghub / webpager
View on GitHub
Paginating the web
☆37Feb 11, 2014Updated 12 years ago
TeamHG-Memex / soft404
View on GitHub
A classifier for detecting soft 404 pages
☆65Apr 8, 2026Updated 3 months ago
eriknw / dask-patternsearch
View on GitHub
Scalable pattern search optimization with dask
☆22Apr 12, 2017Updated 9 years ago
shashwath94 / Hierarchical-Seq2Seq
View on GitHub
A PyTorch implementation of the hierarchical encoder-decoder architecture (HRED) introduced in Sordoni et al (2015). It is a hierarchical…
☆28May 5, 2018Updated 8 years ago
charmyoung / ArtinxRM
View on GitHub
Embedded implementation of pid control with CAN bus using STM32F4 series
☆12Jan 1, 2017Updated 9 years ago
Qualtagh / DAWG
View on GitHub
A Java library capable of constructing character-sequence-storing, directed acyclic graphs of minimal size
☆19Oct 13, 2020Updated 5 years ago
tomazk / Text-Extraction-Evaluation
View on GitHub
Framework for evaluating text extraction algorithms implemented as web services
☆42Jun 30, 2012Updated 14 years ago