DigitalPebble/behemoth

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/DigitalPebble/behemoth)

DigitalPebble / behemoth

Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.

☆283

Alternatives and similar repositories for behemoth

Users that are interested in behemoth are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

wpm / Hadoop-GATE
View on GitHub
A Hadoop job that runs GATE applications
☆15Oct 16, 2013Updated 12 years ago
metzlerd / mavuno
View on GitHub
Mavuno: A Hadoop-Based Text Mining Toolkit
☆48Feb 7, 2012Updated 14 years ago
Aloisius / nutch
View on GitHub
CommonCrawl Test version of Nutch
☆16Jul 10, 2014Updated 12 years ago
tdunning / pig-vector
View on GitHub
Mahout vector encoding for pig
☆53Nov 20, 2022Updated 3 years ago
julienledem / Pig-scripting-examples
View on GitHub
Examples of use of pig scripting languages capabilities
☆39Aug 1, 2016Updated 9 years ago
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
lintool / Cloud9
View on GitHub
Cloud9 is a Hadoop toolkit for working with big data
☆237Dec 15, 2015Updated 10 years ago
ogrisel / pignlproc
View on GitHub
Apache Pig utilities to build training corpora for machine learning / NLP out of public Wikipedia and DBpedia dumps.
☆163Nov 8, 2022Updated 3 years ago
nasa-jpl-memex / memex-gate
View on GitHub
General Architecture for Text Engineering
☆50Mar 23, 2016Updated 10 years ago
lintool / Ivory
View on GitHub
A Hadoop toolkit for web-scale information retrieval research
☆87Dec 12, 2014Updated 11 years ago
cloudera / emailarchive
View on GitHub
Hadoop for archiving email
☆23Sep 29, 2011Updated 14 years ago
frankscholten / mahout
View on GitHub
Mirror of Apache Mahout
☆15Mar 24, 2015Updated 11 years ago
LinkedInAttic / datafu
View on GitHub
Hadoop library for large-scale data processing, now an Apache Incubator project
☆581Jul 8, 2014Updated 12 years ago
jghoman / haivvreo
View on GitHub
Hive + Avro. Serde for working with Avro in Hive
☆60Dec 16, 2023Updated 2 years ago
twitter / elephant-bird
View on GitHub
Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.
☆1,134Apr 10, 2023Updated 3 years ago
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
jzachr / goldenorb
View on GitHub
GoldenOrb is an open-source implementation of Pregel, Google's graph processing framework
☆293Jun 29, 2022Updated 4 years ago
alienrobotwizard / varaha
View on GitHub
Machine learning and natural language processing with Apache Pig
☆53Dec 17, 2013Updated 12 years ago
wihl / Timberwolf
View on GitHub
Hadoop HBase ingestion of Microsoft Exchange
☆15Apr 6, 2012Updated 14 years ago
algoriffic / lsa4solr
View on GitHub
Document clustering based on Latent Semantic Analysis
☆96Apr 29, 2010Updated 16 years ago
CLLKazan / UIMA-Ext
View on GitHub
The set of Apache UIMA addons & utilities.Some of them are language-independent. The others may be Russian language-specific.
☆28Oct 8, 2021Updated 4 years ago
tdunning / Plume
View on GitHub
Explorations relative to cloning FlumeJava
☆94Oct 13, 2020Updated 5 years ago
alienrobotwizard / sounder
View on GitHub
A grouping of Apache Pig examples.
☆65Oct 13, 2020Updated 5 years ago
rdelbru / SIREn
View on GitHub
SIREn - Semi-Structured Information Retrieval Engine
☆109Jun 7, 2021Updated 5 years ago
commoncrawl / commoncrawl-crawler
View on GitHub
The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)
☆226Dec 22, 2022Updated 3 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
NLP4L / attic-nlp4l
View on GitHub
(deprecated) Please use new nlp4l instead.
☆65Sep 22, 2016Updated 9 years ago
jpatanooga / Caduceus
View on GitHub
Set of example algorithm implementations focused on statistics and machine learning
☆31Apr 11, 2011Updated 15 years ago
twitter / cassovary
View on GitHub
Cassovary is a simple big graph processing library for the JVM
☆1,053Oct 8, 2021Updated 4 years ago
jpatanooga / Metronome
View on GitHub
Suite of parallel iterative algorithms built on top of Iterative Reduce
☆111Jun 24, 2014Updated 12 years ago
elsevierlabs-os / soda
View on GitHub
Solr Dictionary Annotator (Microservice for Spark)
☆71Feb 4, 2020Updated 6 years ago
flaxsearch / lucene-solr-intervals
View on GitHub
Flax-maintained fork of Lucene/Solr with support for interval queries
☆15Oct 9, 2015Updated 10 years ago
apache / uima-uimafit
View on GitHub
Apache UIMA uimaFIT
☆33May 15, 2026Updated 2 months ago
tjake / Solandra
View on GitHub
Solandra = Solr + Cassandra
☆881Mar 9, 2016Updated 10 years ago
commoncrawl / commoncrawl
View on GitHub
Common Crawl support library to access 2008-2012 crawl archives (ARC files)
☆508Nov 29, 2017Updated 8 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
o19s / match-query-parser
View on GitHub
Search a single field with different query time analyzers in Solr
☆24Feb 12, 2020Updated 6 years ago
hbase-trx / hbase-transactional-tableindexed
View on GitHub
Transactional and indexing extensions for hbase
☆72Apr 5, 2011Updated 15 years ago
matpalm / common-crawl
View on GitHub
playing around with the common crawl dataset
☆70Aug 18, 2012Updated 13 years ago
sonalgoyal / hiho
View on GitHub
Hadoop Data Integration with various databases, ftp servers, salesforce. Incremental update, dedup, append, merge your data on Hadoop.
☆92Apr 11, 2013Updated 13 years ago
clearnlp / clearnlp
View on GitHub
Fast and robust NLP components implemented in Java.
☆55Oct 13, 2020Updated 5 years ago
LanceNorskog / LSH-Hadoop
View on GitHub
Implementation of Tyler Neylon's Locality-Specific Hash based on simplex tesselations
☆28Oct 15, 2011Updated 14 years ago
datawrangling / spatialanalytics
View on GitHub
Where 2.0 Workshop Code: Spatial Analysis of Tweets using Hadoop, Pig, Python & Mechanical Turk. Slides here: http://www.slideshare.net/…
☆134Mar 31, 2010Updated 16 years ago