Thursday, December 11, 2014

RStudio setup on Hortonworks Hadoop 2.1 Cluster

Here is a complete set of steps I performed to set up R and RStudio on a small cluster.

Installing R and Rstudio
-- R should be installed on the node which have Hive server
-- RStudio can be installed anywhere. (I installed on edge node)

sudo rpm -Uvh
sudo yum -y install git wget R
ls /etc/default
sudo ln -s /etc/default/hadoop /etc/profile.d/
cat /etc/profile.d/ | sed 's/export //g' > ~/.Renviron

Check latest version of RStudio @
(It should have installation steps, follow those)
Listing them here for completion with current release version)
$ sudo yum install openssl098e # Required only for RedHat/CentOS 6 and 7
$ wget
$ sudo yum install --nogpgcheck rstudio-server-0.98.1091-x86_64.rpm

Create a new system user and set password
sudo useradd rstudio
sudo passwd rstudio
>> hadoop

Login to RStudio at http://hostname:8787

Install required packages either from
In RStudio >> Tool >> Install packages
install.packages( c('RJSONIO', 'itertools', 'digest', 'Rcpp', 'functional', 'plyr', 'stringr'), repos='')
install.packages( c('bitops', 'reshape2'), repos='')
install.packages( c('RHive'), repos='')

Download latest rmr2 package from:
Winscp tar.gz file and install through Rstudio

### Need to run every time RStudio is initialized or restarted
Set environment variables in RStudio
Sys.setenv(HADOOP_HOME="your hadoop installation directory here e.g. /usr/lib/hadoop")
Sys.setenv(HIVE_HOME="your hive installation directory here e.g. /usr/lib/hive")
XX Sys.setenv(HADOOP_CONF_DIR="/etc/hadoop/conf/") do not execute!

Sys.setenv("RHIVE_FS_HOME"="your RHive installation directory here e.g. /home/rhive")
This needs to be local directory on the node with hive installed, create one if doesnt exist. The user created (rstudio) have chown -R rights on this local directory.
If not this is the error:
Error: Mkdirs failed to create file:/home/rhive/lib/2.0-0.2

rhive.connect(host="IP ADDRESS/Hostname", port=10000, hiveServer2=TRUE)

If error
Error: java.sql.SQLException: Error while processing statement: file:///rhive/lib/2.0-0.2/rhive_udf.jar does not exist.
check if the jar file is in the said directory and rstudio user has permission on it.

Hope it helps.


References and Thanks: