Home » Blog » Install Hadoop on Virtualbox Virtual Machines

Install Hadoop on Virtualbox Virtual Machines

— Created: Xiaoke, 2017/10/06 15:44 CST
— Last modified: Xiaoke, 2017/10/08 11:40 CST

Introduction

Hadoop is widely used for 'big data' storage and processing currently. This article summarises how it can be installed on a Virtualbox virtual machine.

Create a Virtual Machine

This is straightforward in Virtualbox, the only thing we need to pay attention to is the network. For me, I use TWO network adaptors, one attached to NAT, the other to Host-only Adaptor. By default the Host-only Adaptor is empty, and we need to add one through the Virtualbox network preferences, e.g. vboxnet0 with IP addresses of 192.168.56.1/24. DHCP is recommended to be off.

The reason I have two network adaptors is that, through NAT, the virtual machine can access the Internet, and with the Host-only Adaptor, the host machine is within the same local network of the virtual machine. A bridge connection, of course, can do both, do it if you know how it works.

The following figure shows the Virtualbox network preferences panel.

Install CentOS

During the Installation

CentOS ['sen-toes] has a neat GUI for the installation, which is informative enough by itself. There are several things worth pointing out.

  • Automatic partitioning is sufficient for our purpose. This leaves two partitions, one for the /boot and one for the root /. No partitions for the swap or other mount points, e.g. /home, /var are created.
  • The network and hostname can be configured here if one prefers using GUI. We can also do this later by editing various configuration files.
  • A user can be created and made as an administrator, although we can do this later in command line.

The following figure shows how the installation GUI looks like.

User Creation and Administrator Privilege

In case we do not create a user during the installation process, we'll have to log into the system using the root. Then we are free to create a user, e.g. hadoop, and make it administrator by

adduser -m hadoop
passwd hadoop
usermod -aG wheel hadoop

Network

Somehow, the DHCP client in CentOS 7 does not start with the system, and in order for the NAT of Virtualbox to work, you have to

sudo dhclient

After this, type ip addr, you'll see IP address like 10.0.2.xx listed below one of the network interfaces. This indicates the NAT works and we are able to connect to the Internet. So we can install and update some software packages, e.g.

sudo yum install -y vim perl openssh-client net-tools
sudo yum update

Then we need to configure the host-only network interface, enp0s8 in my case. Since DHCP is not enabled in Virtualbox on this adaptor, we need to configure the network address manually by editing /etc/sysconfig/network-scripts/ifcfg-enp0s8. The file name should be the same as the network interface, and may change on different computers. Open the file and change or add the following lines

ONBOOT=yes
BOOTPROTO=static
IPADDR=192.168.56.100
NETMASK=255.255.255.0

Then restart the network by

sudo systemctl restart network

and restart the DHCP client by

sudo dhclient -r
sudo dhclient

Type ip addr, you'll be able to see both IP addresses as shown in the figure below.

SSH

This SSH configuration aims to simplify the access between hosts in a cluster. First a private-public key pair is generated. The public key is then added to the list of authorised keys so that the host can log into itself without a password. This sounds a bit strange, but Hadoop in Pseudo-Distributed Operation does need to log into itself. SSH can be configured using

ssh-keygen (type enter, enter, enter)
cd ~/.ssh
cp id_rsa.pub authorized_keys

Also open the file /etc/ssh/ssh_config and add or uncomment the following line

StrictHostKeyChecking no

Turn Off Firewall and SELinux

I have no idea at the current stage why this is necessary. Linux firewalls manage incoming connections, and by default deny TCP connections to lots of ports. My guess is that Hadoop runs services over quite a few ports. Turing off the firewall is a simple way to allow connections to those ports, although it might be more secure to add iptables rules.

CentOS 7 comes with a firewall manager for iptables called firewalld. Thus to turn off the firewall, you have to

systemctl stop firewalld
systemctl disable firewalld

Security-Enhanced Linux (SELinux) should also be turned off (still I have no idea why) by editing /etc/selinux/config

SELINUX=disabled

Hostname and Hosts

Hostname, as it suggests, is the name of the host, or what the computer is called in the network. The default name is localhost. Change it to some meaningful ones, e.g. hadoop0, so that hosts in the same network can be distinguished from each other. Do this by editing /etc/hostname,

hadoop0.localdomain

Reboot the virtual machine and type hostname, we'll see the modified name of the computer.

Install JDK

Download and Uncompress

Download JDK from Oracle website, for example, I use jdk-8u144-linux-x64.tar.gz. Copy it to our virtual machine by

scp jdk-8u144-linux-x64.tar.gz hadoop@192.168.56.100:/home/hadoop

Then log into the virtual machine, make a folder /opt/jdk, change permissions and uncompress the tar file by

sudo mkdir /opt/jdk
sudo chown hadoop:hadoop /opt/jdk
tar -xzvf jdk-8u144-linux-x64.tar.gz -C /opt/jdk/

Environment Variables

Then add some environment variables by appending the following lines to the file /etc/profile

export JAVA_HOME=/opt/jdk/jdk1.8.0_144
export PATH=$JAVA_HOME/bin:$PATH

then type

source /etc/profile
java -version

If everything turns right, we'll be able to see something like

java version “1.8.0_144”
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

Install Hadoop

Download and Uncompress

Download Hadoop and copy it to our virtual machine by

scp hadoop-2.8.1.tar.gz hadoop@192.168.56.100:/home/hadoop

Then log into the virtual machine, make a folder /opt/hadoop, change permissions and uncompress the tar file by

sudo mkdir /opt/hadoop
sudo chown hadoop:hadoop /opt/hadoop
tar -xzvf hadoop-2.8.1.tar.gz -C /opt/hadoop

Environment Variables

Then add some environment variables by appending the following lines to the file /etc/profile

export HADOOP_HOME=/opt/hadoop/hadoop-2.8.1
export PATH=$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$PATH
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

The lines starting with HADOOP_INSTALL may not be necessary, and I have no idea what they are for.

Configure hadoop-env.sh

Then open the file /opt/hadoop/hadoop-2.8.1/etc/hadoop/hadoop-env.sh and change the line

export JAVA_HOME=${JAVA_HOME}

to

export JAVA_HOME=/opt/jdk/jdk1.8.0_144

Configure yarn-env.sh

Do the same to /opt/hadoop/hadoop-2.8.1/etc/hadoop/yarn-env.sh

Configure core-site.xml

Put the following lines in this file

<configuration>
<property>
    <name>fs.defaultFS</name>
    <value>hdfs://hadoop0.localdomain:9000/</value>
</property>
<property>
    <name>hadoop.tmp.dir</name>
    <value>/home/hadoop/hadoop/tmp</value>
    <description>A base for other temporary directories</description>
</property>
</configuration>

The second property tells hadoop that hdfs files should be stored at /home/hadoop/hadoop/tmp (this directory is created automatically when formatting the namenode). Otherwise a directory will be created under /tmp/hadoop-$USER. Note that there is NO file:// URI scheme in front of the path. Some people say the path should start with file:///home/..., others say with file:/home/..., neither work for me and I don't know why.

Now we should format the namenode by

hdfs namenode -format

Pay attention to the result, if there are messages like

INFO util.ExitUtil: Exiting with status 0

congratulations, otherwise something is wrong with the core-site.xml file.

Configure hdfs-site.xml

Put the following lines into this file

<configuration><property>
    <name>dfs.replication</name>
    <value>1</value>
</property>
</configuration>

This tells the HDFS that only 1 copy of the data files is stored. Additionally, we can explicitly specify the paths for the meta files of the name node and data files of the data node, such as

<property>  
    <name>dfs.namenode.name.dir</name>  
    <value>/data/hadoop/dfs/name</value>  
</property>  
<property>  
    <name>dfs.datanode.data.dir</name>  
    <value>/data/hadoop/dfs/data</value>  
</property>  

But I am not going to do it now. Without these two paths, the files will be stored under hadoop.tmp.dir.

Configure mapred-site.xml

Put the following lines in this file and I have no idea why currently.

<configuration>
<property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
</property>
</configuration>

Configure yarn-site.xml

Put the following lines in this file and I have no idea why currently, will figure out later :P

<configuration>
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
</configuration>

Start Services

Use command

start-dfs.sh

to start HDFS, including a name node, a data node and secondary name node. Use

jps

to list the running services, this command stands for java ps. Use command

start-yarn.sh

to start YARN. After this we'll be able to access Hadoop through some web interfaces, e.g. 50070, 50090, 8088 etc.

Hadoop Tests

Use the hadoop example jar file and execute

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar pi 10 10

This command estimates the value of PI. If there are results like

Job Finished in 71.946 seconds
Estimated value of Pi is 3.20000000000000000000

Congratulations. More information of the example jar file can be displayed without passing in any parameters, like

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar

Reference