Hadoop 0.20.S Virtual Machine Appliance

At Yahoo!, we recently implemented a stronger notion of security for the Hadoop platform, based on Kerberos as underlying authentication system. We also successfully enabled this feature within Yahoo! on our internal data processing  clusters. I am sure many Hadoop developers and enterprise users are looking forward to get hands-on experience with this enterprise-class Hadoop Security feature.

In the past, we've aided developers and users get started with Hadoop by hosting a comprehensive Hadoop tutorial on YDN, along with a pre-configured single node Hadoop (0.18.0) Virtual Machine appliance.

This time, we decided to upgrade this Hadoop VM with a pre-configured single node Hadoop 0.20.S cluster, along with required Kerberos system components. We have also included Pig (version 0.7.0), a high level SQL-like data processing language used at Yahoo!.

This blog post describes how to get started with the Hadoop 20.S VM appliance. The basic information about downloading, setting up VM Player, and using the Hadoop VM is same as described in the tutorial
module-3
— except the user has to use the following information and links to download the latest VM Player and  Hadoop 0.20.S VM Image. You should also review the following information for security-specific commands that need to be performed before running M/R or Pig jobs.

For more details on deploying and configuring Yahoo! Hadoop 0.20.S security distribution, look for continuing announcements and details on Hadoop-YDN.

Installing and Running the Hadoop 0.20.S Virtual Machine:

  • style="color:rgb(255, 102, 0);">Virtual Machine and Hadoop
    environment:
    See href="https://developer.yahoo.com/hadoop/tutorial/module3.html#vm">details
    here.
  • style="color:rgb(255, 102, 0);">Install VMware Player: style="font-weight:bold;"> See href="https://developer.yahoo.com/hadoop/tutorial/module3.html#vmware-install">details
    here. To download latest VMware Player for Windows/Linux, go to href="http://www.vmware.com/products/player/">Vmware site
  • Setting up the Virtual Environment for Hadoop 0.20.S:
Copy the [ href="http://shared.zenfs.com/hadoop-vm-appliance-0-20-S.zip">Hadoop
0.20.S Virtual Machine] into a location on your hard drive.
It is a zipped vmware folder (hadoop-vm-appliance-0-20-S,
appriox ~400MB), which includes a few files: a .vmdk file that is a
snapshot of the virtual machine's hard drive, and a .vmx file that
contains the configuration information to start the virtual machine.
After unzipping the vmware folder zip file, to start the virtual
machine, double-click on the hadoop-appliance-0.20.S.vmx file.  Note: Uncompressed Size of
hadoop-vm-appliance-0-20-S folder is ~2GB. Also, based on that data you upload
for testing,
VM disk is configured to grow up to 20GB).

When you start
the virtual machine for the first time, VMware Player will recognize
that the virtual machine image is not in the same location it
used to be. You should inform VMware Player that you copied this
virtual machine image (choose "I copied it"). VMware Player will then generate new session
identifiers for this instance of the virtual machine. If you later move
the VM image to a different location on your own hard drive, you
should tell VMware Player that you have moved the image.

After you select this option and click OK, the virtual machine
should begin booting normally. You will see it perform the standard
boot procedure for a Linux system. It will bind itself to an IP address
on an unused network segment, and then display a prompt allowing a user
to log in.

Note: The IP address
displayed on the login screen can be used to connect to VM instance over SSH. The Login
screen also displays information about starting/stopping
Hadoop daemons, users/passwords, and how to shutdown the VM.

Note: It is much more convenient to
access the VM via SSH. See href="https://developer.yahoo.com/hadoop/tutorial/module3.html#vm-ssh">details
here.

  • style="font-weight:bold;">Virtual
    Machine User
    Accounts
    :
The virtual machine comes pre-configured
with two user accounts:
"root" and  "hadoop-user". The hadoop-user account has sudo
permissions to perform system-management functions, such as shutting
down the virtual machine. The vast majority of your interaction with
the virtual machine will be as hadoop-user. To log in as hadoop-user,
first click inside the virtual machine's display. The virtual machine
will take control of your keyboard and mouse. To escape back into
Windows at any time, press CTRL+ALT at the same time. The hadoop-user
user's password is hadoop. To log in as root, the password is root.
  • Hadoop Environment:
Linux    : Ubuntu 8.04
Java      
: JRE 6 Update 7 (See License info @ /usr/jre16/)
Hadoop : 0.20.S  (installed @ /usr/local/hadoop, 
/home/hadoop-user/hadoop is symlink to install directory)
Pig         : 0.7.0 (pig jar is
installed @ /usr/local/pig, 
/home/hadoop-user/pig-tutorial/pig.jar  is  symlink to 

one in install directory)

Login: hadoop-user, Passwd: hadoop
(sudo privileges are granted for
hadoop-user). The other usrers are hdfs and mapred (passwd: hadoop).

Hadoop VM starts all the required hadoop and Kerberos daemons during
the boot-up process, but in case the user needs to stop/restart,

  • To start/stop/restart hadoop: login as hadoop-user and run 'sudo
    /etc/init.d/hadoop [start | stop | restart]' ('sudo /etc/init.d/hadoop'
    gives the usage)
  • To format the HDFS & clean all state/logs: login as
    hadoop-user
    and run 'sudo reinit-hadoop'
  • To start/stop/restart Kerberos KDC Server: login as hadoop-user
    and
    run 'sudo
    /etc/init.d/krb5-kdc [start | stop | restart]'
  • To start/stop/restart Kerberos ADMIN Server: login as hadoop-user
    and run 'sudo
    /etc/init.d/krb5-admin-server [start | stop | restart]'

To shut down the Virtual Machine: login as hadoop-user and run command 'sudo
poweroff'

Environment for 'hadoop-user' (set in /home/hadoop-user/.profile)
  $HADOOP_HOME=/usr/local/hadoop

  $HADOOP_CONF_DIR=/usr/local/etc/hadoop-conf
  $PATH=/usr/local/hadoop/bin:$PATH

  • Running M/R Jobs:
Running M/R jobs in Hadoop 0.20.S is
pretty much same as running
them in non-secure version of Hadoop. Except before running any Hadoop
Jobs or HDFS commands, the hadoop-user needs to get
the Kerberos authentication token using the command 'kinit'; the password
is
hadoopYahoo1234.
For example:
hadoop-user@hadoop-desk:~$ cd hadoop
hadoop-user@hadoop-desk:~$ kinit
Password for hadoop-user@LOCALDOMAIN:  hadoopYahoo1234
hadoop-user@hadoop-desk:~/hadoop$ bin/hadoop jar
hadoop-examples-0.20.104.1.1006042001.jar pi 10 1000000
For automated runs of hadoop jobs, a
keytab file is created under the hadoop-user's home directory (/home/hadoop-user/hadoop-user.keytab).
This will allow user to execute the "kinit" without having to manually enter the password. So
for automated runs of hadoop commands or M/R, Pig jobs through the cron
daemon, users can invoke the following command to get the Kerberos
ticket. Use command 'klist' to view the Kerberos ticket and its validity.
For example:
hadoop-user@hadoop-desk:~$ cd hadoop
hadoop-user@hadoop-desk:~$ kinit -k -t
/home/hadoop-user/hadoop-user.keytab hadoop-user/localhost@LOCALDOMAIN
hadoop-user@hadoop-desk:~/hadoop$ bin/hadoop jar
hadoop-examples-0.20.104.1.1006042001.jar pi 10 1000000
  • Running Pig Tutorial:
The Pig tutorial is installed at
"/home/hadoop-user/pig-tutorial". Example commands to run the Pig
script are given in "example.run.cmd.sh". The Data needed for Pig scripts
are already copied to HDFS. See more details about the href="http://hadoop.apache.org/pig/docs/r0.7.0/tutorial.html">Pig
Tutorial at Pig@Apache
  • hadoop-user@hadoop-desk:~$ cd pig-tutorial
  • hadoop-user@hadoop-desk:~$ sh example.run.cmd.sh
  • Shutting down the VM:
When you are done with the virtual
machine, you can turn it off by logging in as the hadoop-user and running
the command 'sudo poweroff'. The virtual machine will shut itself down
in an orderly fashion and the window it runs in will disappear.

Last but not least, I would like to thank Devaraj Das and Jianyong Dai from the Yahoo! Hadoop & Pig Develoment team for their help in setting up and configuring Hadoop 0.20.S and Pig respectively.


Notice:
Yahoo! does not offer any support for the Hadoop Virtual Machine.
The software include cryptographic software that is subject to U.S. export control laws and applicable export and import laws of other countries. BEFORE using any software made available from this site, it is your responsibility to understand and comply with these laws. This software is being exported in accordance with the Export Administration Regulations. As of June 2009, you are prohibited from exporting and re-exporting this software to Cuba, Iran, North Korea, Sudan, Syria and any other countries specified by regulatory update to the U.S. export control laws and regulations. Diversion contrary to U.S. law is prohibited.

 

Suhas Gogate
Suhas Gogate
Technical Yahoo!, Cloud Solutions Team, Yahoo!