Divide and Recombine (D&R) with RHIPE

Deep Analysis of

Complex Big Data

using the R Environment

What is the problem?

Complex big data are ubiquitous today. They challenge current numeric statistical and machine learning methods, visualization methods, statistical models, computational methods, and computational environments.

What is D&R?

D&R is being developed to meet these many challenges. In a D&R analysis, the data are divided into subsets in one more ways, forming multiple divisions. Numeric and visualization methods are applied to each of the subsets of a division, and the results of each method are recombined across subsets.

What are its technical areas?

A D&R statistical method is a pair, a statistical method for division and a statistical method for recombination. Some of the D&R statistical methods enable very effective study of data whether they are big or small, so they simply extend this best practice to complex big data. Other methods are carried out purely for computational feasibility. D&R computation consists of computational environments for carrying out D\&R, including both software for analysis and cluster architecture.

What is RHIPE?

RHIPE (hree-pay') is the R and Hadoop Integrated Programming Environment. It means "in a moment" in Greek. RHIPE is a merger of R and Hadoop. R is the widely used, highly acclaimed interactive language and environment for data analysis. Hadoop consists of the Hadoop Distributed File System (HDFS) and the MapReduce distributed compute engine. RHIPE allows an analyst to carry out D&R analysis of complex big data wholly from within R. RHIPE communicates with Hadoop to carry out the big, parallel computations.

How did RHIPE get started?

It was first developed by Saptarshi Guha as part of his PhD thesis in the Purdue Statistics Department. Now there is a core development group and a very active Google discussion group.

What does Hadoop do in this?

Transparent to the user, Hadoop

  1. Distributes the subsets into the HDFS across a cluster
  2. Schedules and carries out each subset computation with an algorithm that attempts to use a processor as close to each subset as possible
  3. Computes across the outputs of the subset computations in parallel if needed
  4. Provides fault tolerance
  5. Enables simultaneous fair sharing of the cluster by multiple users through fine-grained intermingling of all subset computations.

What does the user have to do?

The dividing into subsets, the subset computations, and the output computations are user-specified R commands given to RHIPE R commands that manage the communication with Hadoop.

What are the goals of D&R-RHIPE?

There are two goals, both achievable with small datasets. First is deep analysis, which means comprehensive detailed analysis that does not lose important information in the data through inappropriate data reductions. Achieving deep analysis for any data, small or big, requires the use of both visualization methods and numeric methods on both detailed data and summary statistics. The second goal is to allow analysis exclusively from within R, and not have to program in a lower level language, which is much less efficient and effective.

Why Hadoop?

Principally because of capabilities 1 to 5 just described. Hadoop was designed to handle parallel processing on a cluster of machines with possibly very different performance characteristics, which is very practical. This means you can use a cluster of machines bought over time. The old slow ones are able to contribute. Also, Hadoop is supported by the Apache Software Foundation which has an exceptional track record in open-source software.

Why R?

There are several reasons. The first is the exceptionally effective design. It is the public domain version of the S language, which won the ACM Software System Award in 1998 because it would ``forever alter the way people analyze, visualize, and manipulate data''. Other winners are Unix, the World Wide Web, Visicalc, ... , so you get the idea of the company it keeps. R is very widely used, has a very effective core development group, and has a vast number of user contributed packages that add up to, by far, the largest collection of numerical and visualization methods of any software environment for statistics and machine learning.

Why not the parallel R packages?

The parallel R packages are very useful and have a critical role to play in data analysis. However, they will not suffice for analysis of complex big data. They are a parallelization of computation across cores and clusters. However, they do not provide the Hadoop capabilities 1 to 5, which add immensely to the computational effectiveness

How much does this cost?

$0.00: R, Hadoop, and RHIPE are free, open source.