whym/wikihadoop

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/whym/wikihadoop)

whym / wikihadoop

Stream-based InputFormat for processing the compressed XML dumps of Wikipedia with Hadoop

☆85

Alternatives and similar repositories for wikihadoop

Users that are interested in wikihadoop are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

spotify / hadoop-openpgp-codec
View on GitHub
Codec for Hadoop adding OpenPGP encryption using Bouncy Castle
☆17Aug 18, 2011Updated 14 years ago
cloudera / bigtop
View on GitHub
Bigtop is a project for the development of packaging and tests of the Apache Hadoop ecosystem. The primary goal of Bigtop is to build a …
☆51Jul 4, 2011Updated 15 years ago
etsy / cascading.jruby
View on GitHub
A JRuby DSL for Cascading
☆41Sep 23, 2015Updated 10 years ago
matpalm / common-crawl
View on GitHub
playing around with the common crawl dataset
☆70Aug 18, 2012Updated 13 years ago
Apoc2400 / Reftag
View on GitHub
Wikipedia citation tool for Google Books, New York Times, ISBN, DOI and more
☆22Oct 29, 2016Updated 9 years ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
julienledem / Pig-scripting-examples
View on GitHub
Examples of use of pig scripting languages capabilities
☆39Aug 1, 2016Updated 9 years ago
duointeractive / tamarin
View on GitHub
S3 log bucket parser app for Django
☆15Sep 12, 2011Updated 14 years ago
tomafro / rails-activerecord-columnreader
View on GitHub
A simple column reader for ActiveRecord
☆13Nov 1, 2011Updated 14 years ago
dkmfbk / pikes
View on GitHub
Pikes is a Knowledge Extraction Suite
☆23Nov 14, 2023Updated 2 years ago
LanceNorskog / LSH-Hadoop
View on GitHub
Implementation of Tyler Neylon's Locality-Specific Hash based on simplex tesselations
☆28Oct 15, 2011Updated 14 years ago
lintool / Ivory
View on GitHub
A Hadoop toolkit for web-scale information retrieval research
☆87Dec 12, 2014Updated 11 years ago
davidandrzej / chisel
View on GitHub
Clojure wrapper for LDA topic modeling in MALLET
☆33Sep 6, 2011Updated 14 years ago
sishen / hbase-ruby
View on GitHub
ruby client for Hadoop HBase
☆58Mar 8, 2009Updated 17 years ago
hbutani / SQLWindowing
View on GitHub
SQL Windowing Functions for Hadoop
☆65Jun 20, 2022Updated 4 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
jblomo / oddjob
View on GitHub
useful JVM classes for the mrjob hadoop streaming framework
☆31Jun 20, 2013Updated 13 years ago
Fonsan / dunder
View on GitHub
For ruby; a simple way of doing heavy work in a background thread in and when you really need the object it will block until it is done
☆23Apr 8, 2011Updated 15 years ago
agesmundo / HadoopPerceptron
View on GitHub
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36266.pdf
☆14Apr 25, 2012Updated 14 years ago
lintool / Cloud9
View on GitHub
Cloud9 is a Hadoop toolkit for working with big data
☆237Dec 15, 2015Updated 10 years ago
toddlipcon / mlockall_agent
View on GitHub
JVMTI agent which calls mlockall and setuids down to a target user upon initialization
☆21Sep 13, 2011Updated 14 years ago
sunng87 / clojalk
View on GitHub
A beanstalkd (distributed task queue) clone in clojure
☆20Dec 11, 2011Updated 14 years ago
luposdate / luposdate
View on GitHub
Semantic Web database
☆19Sep 1, 2022Updated 3 years ago
sgibbons / cenum
View on GitHub
C-style enums for ruby
☆14Jun 12, 2011Updated 15 years ago
jzachr / goldenorb
View on GitHub
GoldenOrb is an open-source implementation of Pregel, Google's graph processing framework
☆293Jun 29, 2022Updated 4 years ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
josephreisinger / dist_lda
View on GitHub
distributed latent dirichlet allocation
☆29Dec 15, 2011Updated 14 years ago
infochimps-labs / wonderdog
View on GitHub
Bulk loading for elastic search
☆186Dec 16, 2023Updated 2 years ago
dapete42 / vcat
View on GitHub
vCat Java code
☆11Updated this week
distributed-text-services / distributed-text-services.github.io
View on GitHub
☆11Feb 13, 2026Updated 4 months ago
kasei / attean
View on GitHub
A Perl Semantic Web Framework
☆19May 20, 2026Updated last month
geoparser / geolocator-3.0
View on GitHub
☆12Oct 25, 2015Updated 10 years ago
sudar / Yahoo_LDA
View on GitHub
Yahoo!'s topic modelling framework using Latent Dirichlet Allocation
☆337Sep 21, 2011Updated 14 years ago
evolvedbinary / docker-existdb
View on GitHub
Docker image builder for eXist-db
☆13Mar 16, 2021Updated 5 years ago
unixpickle / neuralspell
View on GitHub
Spell and pronounce words with a neural network
☆10Feb 13, 2017Updated 9 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
aria42 / type-level-tagger
View on GitHub
State-of-The-Art Unsupervised Part-Of-Speech Type-Level Tagger in 300 Lines of Clojure
☆41Sep 15, 2010Updated 15 years ago
joewilliams / haproxy_join
View on GitHub
Break up your haproxy configs and join them together
☆57Sep 7, 2012Updated 13 years ago
ksclarke / freelib-marc4j-exist
View on GitHub
An extension for eXist-db that allows the reading and writing of MARC into and out from the database
☆11Mar 6, 2016Updated 10 years ago
mozy / ruby-protocol-buffers
View on GitHub
An implementation of Protocol Buffers for Ruby.
☆58Feb 20, 2013Updated 13 years ago
bpaquet / vmware-cli
View on GitHub
VMWare Cli for ESX, ESXi and Converter
☆18Apr 22, 2015Updated 11 years ago
securityroots / passdb
View on GitHub
Ruby interface to cirt.net default passwords database
☆19May 4, 2011Updated 15 years ago
nexacenter / public-contracts
View on GitHub
☆10Apr 20, 2016Updated 10 years ago