Showing posts with label hadoop. Show all posts
Showing posts with label hadoop. Show all posts

Friday, March 13, 2015

Changing Storage in Hadoop

After I installed a HDP 2.1.2 cluster, I noticed that all the nodes were not using the drive partition planned for storage. The Linux boxes had OS partition and data data partition. Assigned during OS install one set to OS and other for data storage.

Somehow the data storage was not available on cluster installation most probably since it was not mounted. Following are the steps performed to change HDFS storage location, along with any drive configuration needed.

First format and optimized the partition or drive.
mkfs -t ext4 -m 1 -O dir_index,extent,sparse_super /dev/sdb

Create a mount directory
mkdir -p /disk/sdb1

Mount with optimized settings
mount -noatime -nodiratime /dev/sdb /disk/sdb1

Append to fstab file so that the partition is mounted on boot (very critical)
echo "/dev/sdb /disk/sdb1 ext4 defaults,noatime,nodiratime 1 2" >> /etc/fstab

Add folder for hdfs data
mkdir -p /disk/sdb1/data

Location to store Namenode data
mkdir -p /disk/sdb1/hdfs/namenode

Location to store Secondary Namenode
mkdir -p /disk/sdb1/hdfs/namesecondary

Set these in hdfs-site.xml or through Ambari
dfs.namenode.name.dir = /disk/sdb1/hdfs/namenode
dfs.namenode.name.dir = /disk/sdb1/hdfs/namesecondary
dfs.datanode.data.dir = /disk/sdb1/data

Set permissions
sudo chown -R hdfs:hadoop /disk/sdb1/data

Format namenode
hadoop namenode -format

Start namenode through ambari or CLI
hadoop namenode start

Start all nodes and services. The new drive should be listed.

References:
http://www.slideshare.net/leonsp/best-practices-for-deploying-hadoop-biginsights-in-the-cloud


Thursday, December 11, 2014

RStudio setup on Hortonworks Hadoop 2.1 Cluster

Here is a complete set of steps I performed to set up R and RStudio on a small cluster.

Installing R and Rstudio
-- R should be installed on the node which have Hive server
-- RStudio can be installed anywhere. (I installed on edge node)

sudo rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
sudo yum -y install git wget R
ls /etc/default
sudo ln -s /etc/default/hadoop /etc/profile.d/hadoop.sh
cat /etc/profile.d/hadoop.sh | sed 's/export //g' > ~/.Renviron

Check latest version of RStudio @
http://www.rstudio.com/products/rstudio/download-server/
(It should have installation steps, follow those)
Listing them here for completion with current release version)
$ sudo yum install openssl098e # Required only for RedHat/CentOS 6 and 7
$ wget http://download2.rstudio.org/rstudio-server-0.98.1091-x86_64.rpm
$ sudo yum install --nogpgcheck rstudio-server-0.98.1091-x86_64.rpm

Create a new system user and set password
sudo useradd rstudio
sudo passwd rstudio
>> hadoop

Login to RStudio at http://hostname:8787

Install required packages either from
In RStudio >> Tool >> Install packages
OR
install.packages( c('RJSONIO', 'itertools', 'digest', 'Rcpp', 'functional', 'plyr', 'stringr'), repos='http://cran.revolutionanalytics.com')
install.packages( c('bitops', 'reshape2'), repos='http://cran.revolutionanalytics.com')
install.packages( c('RHive'), repos='http://cran.revolutionanalytics.com')

Download latest rmr2 package from:
https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads
Winscp tar.gz file and install through Rstudio

### Need to run every time RStudio is initialized or restarted
Set environment variables in RStudio
Sys.setenv(HADOOP_HOME="your hadoop installation directory here e.g. /usr/lib/hadoop")
Sys.setenv(HIVE_HOME="your hive installation directory here e.g. /usr/lib/hive")
XX Sys.setenv(HADOOP_CONF_DIR="/etc/hadoop/conf/") do not execute!

Sys.setenv("RHIVE_FS_HOME"="your RHive installation directory here e.g. /home/rhive")
This needs to be local directory on the node with hive installed, create one if doesnt exist. The user created (rstudio) have chown -R rights on this local directory.
If not this is the error:
Error: java.io.IOException: Mkdirs failed to create file:/home/rhive/lib/2.0-0.2

library(RHive)
rhive.init()
rhive.connect(host="IP ADDRESS/Hostname", port=10000, hiveServer2=TRUE)

If error
Error: java.sql.SQLException: Error while processing statement: file:///rhive/lib/2.0-0.2/rhive_udf.jar does not exist.
check if the jar file is in the said directory and rstudio user has permission on it.

Hope it helps.

Cheers!

References and Thanks:
http://jsolderitsch.wordpress.com/hortonworks-sandbox-r-and-rstudio-install/