Thursday, December 11, 2014

RStudio setup on Hortonworks Hadoop 2.1 Cluster

Here is a complete set of steps I performed to set up R and RStudio on a small cluster.

Installing R and Rstudio
-- R should be installed on the node which have Hive server
-- RStudio can be installed anywhere. (I installed on edge node)

sudo rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
sudo yum -y install git wget R
ls /etc/default
sudo ln -s /etc/default/hadoop /etc/profile.d/hadoop.sh
cat /etc/profile.d/hadoop.sh | sed 's/export //g' > ~/.Renviron

Check latest version of RStudio @
http://www.rstudio.com/products/rstudio/download-server/
(It should have installation steps, follow those)
Listing them here for completion with current release version)
$ sudo yum install openssl098e # Required only for RedHat/CentOS 6 and 7
$ wget http://download2.rstudio.org/rstudio-server-0.98.1091-x86_64.rpm
$ sudo yum install --nogpgcheck rstudio-server-0.98.1091-x86_64.rpm

Create a new system user and set password
sudo useradd rstudio
sudo passwd rstudio
>> hadoop

Login to RStudio at http://hostname:8787

Install required packages either from
In RStudio >> Tool >> Install packages
OR
install.packages( c('RJSONIO', 'itertools', 'digest', 'Rcpp', 'functional', 'plyr', 'stringr'), repos='http://cran.revolutionanalytics.com')
install.packages( c('bitops', 'reshape2'), repos='http://cran.revolutionanalytics.com')
install.packages( c('RHive'), repos='http://cran.revolutionanalytics.com')

Download latest rmr2 package from:
https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads
Winscp tar.gz file and install through Rstudio

### Need to run every time RStudio is initialized or restarted
Set environment variables in RStudio
Sys.setenv(HADOOP_HOME="your hadoop installation directory here e.g. /usr/lib/hadoop")
Sys.setenv(HIVE_HOME="your hive installation directory here e.g. /usr/lib/hive")
XX Sys.setenv(HADOOP_CONF_DIR="/etc/hadoop/conf/") do not execute!

Sys.setenv("RHIVE_FS_HOME"="your RHive installation directory here e.g. /home/rhive")
This needs to be local directory on the node with hive installed, create one if doesnt exist. The user created (rstudio) have chown -R rights on this local directory.
If not this is the error:
Error: java.io.IOException: Mkdirs failed to create file:/home/rhive/lib/2.0-0.2

library(RHive)
rhive.init()
rhive.connect(host="IP ADDRESS/Hostname", port=10000, hiveServer2=TRUE)

If error
Error: java.sql.SQLException: Error while processing statement: file:///rhive/lib/2.0-0.2/rhive_udf.jar does not exist.
check if the jar file is in the said directory and rstudio user has permission on it.

Hope it helps.

Cheers!

References and Thanks:
http://jsolderitsch.wordpress.com/hortonworks-sandbox-r-and-rstudio-install/