Installing PHD Manually

Installing PHD Manually

PHD 3.0

Installing PHD Manually

Getting Ready to Install (Summary)

Meet Minimum System Requirements

To run the Pivotal HD, your system must meet minimum requirements. See Installing Pivotal HD with Abari 1.7 for system requirements.

Installing and Configuring Oracle JDK 7

Use the following instructions to manually install JDK 7:

  • Verify that you have a /usr/java directory. If not, create one:

    mkdir /usr/java

  • Download the Oracle 64-bit JDK (jdk-7u67-linux-x64.tar.gz) from the Oracle download site. Open a web browser and navigate to http://www.oracle.com/technetwork/java/javase/downloads/java-archive-downloads-javase7-521261.html.

  • Copy the downloaded jdk-7u67-linux-x64.gz file to the /usr/java directory.

  • Navigate to the /usr/java folder and extract the jdk-7u67-linux-x64.gz file. cd /usr/java tar zxvf jdk-7u67-linux-x64.gz The JDK files will be extracted into a /usr/java/jdk1.7.0_67 directory.

  • Create a symbolic link (symlink) to the JDK:

    ln -s /usr/java/jdk1.7.0_67 /usr/java/default

  • Set the JAVA_HOME and PATH environment variables.

    export JAVA_HOME=/usr/java/default export PATH=$JAVA_HOME/bin:$PATH

  • Verify that Java is installed in your environment by running the following command:

    java -version You should see output similar to the following:

    
    java version "1.7.0_67"
    Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
    Java HotSpot(TM) 64-Bit Server VM (build 24.67-b01, mixed mode)
    
OpenJDK 7
    OpenJDK7 on PHD 3.0 does not work if you are using SLES as your OS. Use the following instructions to manually install OpenJDK 7:
  • Check the version. From a terminal window, type:

    java -version

  • (Optional) Uninstall the Java package if the JDK version is less than 7. For example, if you are using Centos:

    rpm -qa | grep java yum remove {java-1.*}

  • (Optional) Verify that the default Java package is uninstalled.

    which java

  • (Optional) Download OpenJDK 7 RPMs. From the command-line, run:

    RedHat/CentOS Linux:

    yum install java-1.7.0-openjdk java-1.7.0-openjdk-devel

    SUSE:

    zypper install java-1.7.0-openjdk java-1.7.0-openjdk-devel

  • (Optional) Create symbolic links (symlinks) to the JDK.

    mkdir /usr/java ln -s /usr/phd/current/jvm/java-1.7.0-openjdk-1.7.0.51.x86_64 /usr/java/default

  • (Optional) Set up your environment to define JAVA_HOME to put the Java Virtual Machine and the Java compiler on your path.

    export JAVA_HOME=/usr/java/default export PATH=$JAVA_HOME/bin:$PATH

  • (Optional) Verify if Java is installed in your environment. Execute the following from the command-line console:

    java -version

    You should see output similar to the following:

    openjdk version "1.7.0" OpenJDK Runtime Environment (build 1.7.0) OpenJDK Client VM (build 20.6-b01, mixed mode)

Installing and Configuring the Metastore
    The database administrator must create the following users and specify the following values.
  • For Hive: hive_dbname, hive_dbuser, and hive_dbpasswd.

  • For Oozie: oozie_dbname, oozie_dbuser, and oozie_dbpasswd.

Installing and Configuring PostgreSQL

The following instructions explain how to install PostgreSQL as the metastore database. See your third-party documentation for instructions on how to install other supported databases.

RHEL/CentOS Linux

    To install a new instance of PostgreSQL:
  • Connect to the host machine where you plan to deploy PostgreSQL instance.

    At a terminal window, enter:

    yum install postgresql-server

  • Start the instance.

    /etc/init.d/postgresql start

  • Reconfigure PostgreSQL server:

    • Edit the /var/lib/pgsql/data/postgresql.conf file.



      Change the value of #listen_addresses = 'localhost' to listen_addresses = '*'

    • Edit the /var/lib/pgsql/data/postgresql.conf file.

      Change the port setting #port = 5432 to port = 5432

    • Edit the /var/lib/pgsql/data/pg_hba.conf

      Add the following: host all all 0.0.0.0/0 trust

    • Optional: If you are using PostgreSQL v9.1 or later, add the following to the /var/lib/pgsql/data/postgresql.conf file:

      standard_conforming_strings = off

  • Create users for PostgreSQL server.

    As the postgres user, enter:

    
    echo "CREATE DATABASE $dbname;" | sudo -u $postgres psql -U postgres
    echo "CREATE USER $user WITH PASSWORD '$passwd';" | sudo -u psql -U postgres
    echo "GRANT ALL PRIVILEGES ON DATABASE $dbname TO $user;" | sudo -u psql -U postgres
    

    where $postgres is the postgres user. $user is the user you want to create $dbname is the name of your postgres database

  • On the Hive Metastore host, install the connector.

    yum install postgresql-jdbc*

  • Copy the connector .jar file to the Java share directory.

    cp /usr/share/pgsql/postgresql-*.jdbc3.jar /usr/share/java/postgresql-jdbc.jar

  • Confirm that the .jar is in the Java share directory.

    ls /usr/share/java/postgresql-jdbc.jar

  • Change the access mode of the .jar file to 644.

    chmod 644 /usr/share/java/postgresql-jdbc.jar

SLES

    To install a new instance of PostgreSQL:
  • Connect to the host machine where you plan to deploy PostgreSQL instance.

    At a terminal window, enter:

    zypper install postgresql-server

  • Start the instance.

    /etc/init.d/postgresql start

  • Reconfigure PostgreSQL server:

    • Edit the /var/lib/pgsql/data/postgresql.conf file.

      Change the value of #listen_addresses = 'localhost' to listen_addresses = '*'

    • Edit the /var/lib/pgsql/data/postgresql.conf file.

      Change the port setting #port = 5432 to port = 5432

    • Edit the /var/lib/pgsql/data/pg_hba.conf

      Add the following: host all all 0.0.0.0/0 trust

    • Optional: If you are using PostgreSQL v9.1 or later, add the following to the /var/lib/pgsql/data/postgresql.conf file:

      standard_conforming_strings = off

  • Create users for PostgreSQL server.

    As the postgres user, enter:

    
    echo "CREATE DATABASE $dbname;" | sudo -u $postgres psql -U postgres
    echo "CREATE USER $user WITH PASSWORD '$passwd';" | sudo -u psql -U postgres
    echo "GRANT ALL PRIVILEGES ON DATABASE $dbname TO $user;" | sudo -u psql -U postgres
    

    where $postgres is the postgres user. $user is the user you want to create $dbname is the name of your postgres database

  • On the Hive Metastore host, install the connector.

    zypper install -y postgresql-jdbc

  • Copy the connector .jar file to the Java share directory.

    cp /usr/share/pgsql/postgresql-*.jdbc3.jar /usr/share/java/postgresql-jdbc.jar

  • Confirm that the .jar is in the Java share directory.

    ls /usr/share/java/postgresql-jdbc.jar

  • Change the access mode of the .jar file to 644.

    chmod 644 /usr/share/java/postgresql-jdbc.jar

Installing and Configuring MySQL

This section describes how to install MySQL as the metastore database. For instructions on how to install other supported databases, see your third-party documentation.

RHEL/CentOS

    To install a new instance of MySQL:
  • Connect to the host machine you plan to use for Hive and HCatalog.

  • Install MySQL server.

    From a terminal window, enter:

    yum install mysql-server

  • Start the instance.

    /etc/init.d/mysqld start

  • Set the root user password using the following command format:

    mysqladmin -u root password $mysqlpassword

    For example, to set the password to "root":

    mysqladmin -u root password root

  • Remove unnecessary information from log and STDOUT.

    mysqladmin -u root 2>&1 >/dev/null

  • Log in to MySQL as root:

    mysql -u root -proot

  • As root, create the “dbuser” and grant it adequate privileges.

    This user provides access to the Hive metastore. Use the following series of commands (shown here with the returned responses) to create dbuser with password dbuser.

    [root@c6402 /]# mysql -u root -proot

    Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 11 Server version: 5.1.73 Source distribution

    Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.

    Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.

    Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

    mysql> CREATE USER 'dbuser'@'localhost' IDENTIFIED BY 'dbuser'; Query OK, 0 rows affected (0.00 sec)

    mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'localhost'; Query OK, 0 rows affected (0.00 sec)

    mysql> CREATE USER 'dbuser'@'%' IDENTIFIED BY 'dbuser'; Query OK, 0 rows affected (0.00 sec)

    mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'%'; Query OK, 0 rows affected (0.00 sec)

    mysql> FLUSH PRIVILEGES; Query OK, 0 rows affected (0.00 sec)

    mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'localhost' WITH GRANT OPTION; Query OK, 0 rows affected (0.00 sec)

    mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'%' WITH GRANT OPTION; Query OK, 0 rows affected (0.00 sec)

    mysql>

  • Use the exit command to exit MySQL.

  • You should now be able to reconnect to the database as "dbuser" using the following command:

    mysql -u dbuser -pdbuser

    After testing the dbuser login, use the exit command to exit MySQL.

  • Install the MySQL connector JAR file.

    yum install mysql-connector-java*

SLES

    To install a new instance of MySQL:
  • Connect to the host machine you plan to use for Hive and HCatalog.

  • Install MySQL server.

    From a terminal window, enter:

    zypper install mysql-server

  • Start the instance.

    /etc/init.d/mysqld start

  • Set the root user password using the following command format:

    mysqladmin -u root password $mysqlpassword

    For example, to set the password to "root":

    mysqladmin -u root password root

  • Remove unnecessary information from log and STDOUT.

    mysqladmin -u root 2>&1 >/dev/null

  • Log in to MySQL as root:

    mysql -u root -proot

  • As root, create dbuser and grant it adequate privileges.

    This user provides access to the Hive metastore. Use the following series of commands (shown here with the returned responses) to create dbuser with password dbuser.

    [root@c6402 /]# mysql -u root -proot

    Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 11 Server version: 5.1.73 Source distribution

    Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.

    Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.

    Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

    mysql> CREATE USER 'dbuser'@'localhost' IDENTIFIED BY 'dbuser'; Query OK, 0 rows affected (0.00 sec)

    mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'localhost'; Query OK, 0 rows affected (0.00 sec)

    mysql> CREATE USER 'dbuser'@'%' IDENTIFIED BY 'dbuser'; Query OK, 0 rows affected (0.00 sec)

    mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'%'; Query OK, 0 rows affected (0.00 sec)

    mysql> FLUSH PRIVILEGES; Query OK, 0 rows affected (0.00 sec)

    mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'localhost' WITH GRANT OPTION; Query OK, 0 rows affected (0.00 sec)

    mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'%' WITH GRANT OPTION; Query OK, 0 rows affected (0.00 sec)

    mysql>

  • Use the exit command to exit MySQL.

  • You should now be able to reconnect to the database as dbuser, using the following command:

    mysql -u dbuser -pdbuser

    After testing the dbuser login, use the exit command to exit MySQL.

  • Install the MySQL connector JAR file.

    zypper install mysql-connector-java*

Configuring Oracle as the metastore Database

You can select Oracle as the metastore database. For instructions on how to install the databases, see your third-party documentation. Because the following procedure is Hive-specific,you must install PHD and Hive before making these configurations. See Set up OracleDB for use with Hive.

Virtualization and Cloud Platforms

PHD is certified and supported when running on virtual or cloud platforms (for example, VMware vSphere or Amazon Web Services EC2) as long as the respective guest operating system (OS) is supported by PHD and any issues detected on these platforms are reproducible on the same supported OS installed on bare metal.

See Meet Minimum System Requirements for the list of supported operating systems for PHD.

Configure the Repositories

The standard PHD install downloads the software from Pivotal Network to a local directory. See Installing Pivotal HD with Ambari 1.7 for information about downloading the distribution files and seting up the local repository.

Decide on Deployment Type

While it is possible to deploy all of PHD on a single host, this is appropriate only for initial evaluation. In general you should use at least four hosts: one master host and three slaves.

Collect Information

    To deploy your PHD installation, you need to collect the following information:
  • The fully qualified domain name (FQDN) for each host in your system, and which component(s) you want to set up on which host. You can use hostname -f to check for the FQDN if you do not know it.

  • The hostname, database name, username, and password for the metastore instance, if you install Hive/HCatalog.

Prepare the Environment

Enable NTP on the Cluster

    The clocks of all the nodes in your cluster must be able to synchronize with each other. If your system does not have access to the Internet, set up a master node as an NTP xserver. Use the following instructions to enable NTP for your cluster:
  • Configure NTP clients. Execute the following command on all the nodes in your cluster:

    • For RHEL/CentOS Linux:

      yum install ntp

    • For SLES:

      zypper install ntp

  • Enable the service. Execute the following command on all the nodes in your cluster. For RHEL/CentOS Linux: chkconfig ntpd on

  • Start the NTP. Execute the following command on all the nodes in your cluster. For RHEL/CentOS Linux: /etc/init.d/ntpd start For SLES: /etc/init.d/ntp start

  • If you want to use the existing NTP server in your environment, configure the firewall on the local NTP server to enable UDP input traffic on port 123 and replace 192.0.2.0/24 with the ip addresses in the cluster. For example on RHEL hosts you would use:

    # iptables -A RH-Firewall-1-INPUT -s 192.0.2.0/24 -m state --state NEW -p udp --dport 123 -j ACCEPT

  • Then, restart iptables. Execute the following command on all the nodes in your cluster:

    # service iptables restart

  • Finally, configure clients to use the local NTP server. Edit the /etc/ntp.conf and add the following line:

    server $LOCAL_SERVER_IP OR HOSTNAME

Check DNS

All hosts in your system must be configured for DNS and Reverse DNS.

    Use the following instructions to check DNS for all the host machines in your cluster:
  • Forward lookup checking. For example, for domain localdomain that contains host with name host01 and IP address 192.168.0.10, execute the following command:

    nslookup host01

    You should see a message similar to the following:

    Name: host01.localdomain Address: 192.168.0.10

  • Reverse lookup checking. For example, for domain localdomain that contains host with name host01 and IP address 192.168.0.10, execute the following command:

    nslookup 192.168.0.10

    You should see a message similar to the following:

    10.0.168.192.in-addr.arpa name = host01.localdomain.

    For all nodes of cluster, add to the /etc/hosts file key-value pairs like the following:

    192.168.0.11 host01

If you do not receive valid responses (as shown above), set up a DNS zone in your cluster or configure host files on each host of the cluster using one of the following options:

  • Option I: Configure hosts file on each node of the cluster. The following instructions, use the example values given below: Example values:domain name: “localdomain” nameserver: “host01”/192.168.0.11hosts: “host02”/192.168.0.12, “host02”/192.168.0.12

  • Option II: Configuring DNS using BIND nameserver. The following instructions use the example values given below: domain name: “localdomain” nameserver: “host01”/192.168.0.11 hosts: “host02”/192.168.0.12, “host02”/192.168.0.12

    • Install BIND packages:

      yum install bind yum install bind-libs yum install bind-utils

    • Initiate service

      chkconfig named on

    • Configure files. Add the following lines for the example values given above (ensure that you modify these for your environment) :

      • Edit /etc/resolv.conf (for all nodes in cluster) and add the following lines:

        domain localdomain search localdomain nameserver 192.168.0.11

      • Edit /etc/named.conf (for all nodes in cluster) and add the following lines:

        listen-on port 53 { any; };//by default it is opened only for localhost  ...zone "localdomain" {    type master;    notify no;    allow-query { any; };    file "named-forw.zone";    };    zone "0.168.192.in-addr.arpa" {       type master;       notify no;       allow-query { any; };       file "named-rev.zone"; };

      • Edit named-forw.zone as shown in the following sample forward zone configuration file:

        $TTL 3D @    SOAhost01.localdomain.root.localdomain (201306030;3600;3600;3600;3600) NS    host01  ; Nameserver Address localhost    IN    A   127.0.0.1 host01        IN    A   192.168.0.11 host02        IN    A   192.168.0.12 host03        IN    A   192.168.0.13

      • Edit the named-rev.zone as shown in the following sample reverse zone configuration file:

        $TTL 3D @ SOA host01.localdomain.root.localdomain. (201306031;28800;2H;4W;1D); NS host01.localdomain.; Nameserver Address 11 IN   PTR   host01.localdomain. 12 IN   PTR   host02.localdomain. 13 IN   PTR   host03.localdomain.

    • Restart bind service.

      /etc/init.d/named restart

    • Add rules to firewall.

      iptables -A INPUT -p udp -m state --state NEW --dport 53 -j ACCEPT iptables -A INPUT -p tcp -m state --state NEW --dport 53 -j ACCEPT service iptables save service iptables restart

      Alternatively, you can also allow traffic over DNS port (53) using system-config-firewall utility.

Disable SELinux

Security-Enhanced (SE) Linux feature should be disabled during installation process.

  • Check state of SELinux. On all the host machines, execute the following command:

    getenforce

    If the result is permissive or disabled, no further actions are required. If the result is enabled, proceed to step 2.

  • Disable SELinux either temporarily for each session or permanently.

    • Option I: Disable SELinux temporarily by executing the following command:

      setenforce 0

    • Option II: Disable SELinux permanently in the /etc/sysconfig/selinux file by changing the value of SELINUX field to permissive or disabled. Restart your system.

Disable IPTables

Certain ports must be open and available during installation. The easiest way to do this is to temporarily disable iptables. If the security protocols at your installation do not allow you to disable iptables, you can proceed with them on, as long as all of the relevant ports are open and available. See Configuring Ports for more information.

On all RHEL/CentOS host machines, execute the following command to disable iptables:

chkconfig iptables off service iptables stop

Restart iptables after your setup is complete.

Download Companion Files

We have provided a set of companion files, including script files (scripts.zip) and configuration files (configuration_files.zip), that you should download, fill in the TODO lines, and use throughout this process.

Edit the TODOs with values specific to your system or you will see errors.

Pivotal strongly recommends that you copy usersAndGroups.sh and directories.sh to your ~/.bash_profile to set up these environment variables in your environment.

The following provides a snapshot of a sample script file to create Hadoop directories. This sample script file sources the files included in Companion Files.

#!/bin/bash
./usersAndGroups.sh
./directories.sh

echo "Create datanode local dir"
mkdir -p $DFS_DATA_DIR;

echo "Create namenode local dir"
mkdir -p $DFS_NN_DIR;

echo "Create  namenode local dir"
mkdir -p $DFS_SN_DIR;
chown -R $HDFS_USER:$HADOOP_GROUP $DFS_DATA_DIR;
chmod -R 750 $DFS_DATA_DIR;

echo "Create yarn local dir"
mkdir -p $YARN_LOCAL_DIR;
chown -R $YARN_USER:$HADOOP_GROUP $YARN_LOCAL_DIR;
chmod -R 755 $YARN_LOCAL_DIR;

echo "Create yarn local log dir"
mkdir -p $YARN_LOCAL_LOG_DIR;
chown -R $YARN_USER:$HADOOP_GROUP $YARN_LOCAL_LOG_DIR;
chmod -R 755 $YARN_LOCAL_LOG_DIR;

echo "Create zookeeper local data dir"
mkdir -p $ZOOKEEPER_DATA_DIR;
chown -R $ZOOKEEPER_USER:$HADOOP_GROUP $ZOOKEEPER_DATA_DIR;
chmod -R 755 $ZOOKEEPER_DATA_DIR; 

Define Environment Parameters

    You need to set up specific users and directories for your PHD installation using the following instructions:
  • Define directories.

    The following table describes the directories for install, configuration, data, process IDs and logs based on the Hadoop Services you plan to install. Use this table to define what you are going to use in setting up your environment.

    Define Directories for Core Hadoop

    Hadoop Service

    Parameter

    Definition

    HDFS

    DFS_NAME_DIR

    Space separated list of directories where NameNode should store the file system image. For example, /grid/hadoop/hdfs/nn /grid1/hadoop/hdfs/nn

    HDFS

    DFS_DATA_DIR

    Space separated list of directories where DataNodes should store the blocks. For example, /grid/hadoop/hdfs/dn /grid1/hadoop/hdfs/dn /grid2/hadoop/hdfs/dn

    HDFS

    FS_CHECKPOINT_DIR

    Space separated list of directories where SecondaryNameNode should store the checkpoint image. For example, /grid/hadoop/hdfs/snn /grid1/hadoop/hdfs/snn /grid2/hadoop/hdfs/snn

    HDFS

    HDFS_LOG_DIR

    Directory for storing the HDFS logs. This directory name is a combination of a directory and the $HDFS_USER. For example, /var/log/hadoop/hdfs where hdfs is the $HDFS_USER.

    HDFS

    HDFS_PID_DIR

    Directory for storing the HDFS process ID. This directory name is a combination of a directory and the $HDFS_USER. For example, /var/run/hadoop/hdfs where hdfs is the $HDFS_USER

    HDFS

    HADOOP_CONF_DIR

    Directory for storing the Hadoop configuration files. For example, /etc/hadoop/conf

    YARN

    YARN_LOCAL_DIR

    Space-separated list of directories where YARN should store temporary data. For example, /grid/hadoop/yarn /grid1/hadoop/yarn /grid2/hadoop/yarn

    YARN

    YARN_LOG_DIR

    Directory for storing the YARN logs. For example, /var/log/hadoop/yarn. This directory name is a combination of a directory and the $YARN_USER. In the example yarn is the $YARN_USER.

    YARN

    YARN_LOCAL_LOG_DIR

    Space-separated list of directories where YARN will store container log data. For example, /grid/hadoop/yarn/logs /grid1/hadoop/yarn/log

    YARN

    YARN_PID_DIR

    Directory for storing the YARN process ID. For example, /var/run/hadoop/yarn. This directory name is a combination of a directory and the $YARN_USER. In the example, yarn is the $YARN_USER.

    MapReduce

    MAPRED_LOG_DIR

    Directory for storing the JobHistory Server logs. For example, /var/log/hadoop/mapred. This directory name is a combination of a directory and the $MAPRED_USER. In the example mapred is the $MAPRED_USER.

    Define Directories for Ecosystem Components

    Hadoop Service

    Parameter

    Definition

    Pig

    PIG_CONF_DIR

    Directory to store the Pig configuration files. For example, /etc/pig/conf.

    Pig

    PIG_LOG_DIR

    Directory to store the Pig logs. For example, /var/log/pig.

    Pig

    PIG_PID_DIR

    Directory to store the Pig process ID. For example, /var/run/pig.

    Oozie

    OOZIE_CONF_DIR

    Directory to store the Oozie configuration files. For example, /etc/oozie/conf.

    Oozie

    OOZIE_DATA

    Directory to store the Oozie data. For example, /var/db/oozie.

    Oozie

    OOZIE_LOG_DIR

    Directory to store the Oozie logs. For example, /var/log/oozie.

    Oozie

    OOZIE_PID_DIR

    Directory to store the Oozie process ID. For example, /var/run/oozie.

    Oozie

    OOZIE_TMP_DIR

    Directory to store the Oozie temporary files. For example, /var/tmp/oozie.

    Hive

    HIVE_CONF_DIR

    Directory to store the Hive configuration files. For example, /etc/hive/conf.

    Hive

    HIVE_LOG_DIR

    Directory to store the Hive logs. For example, /var/log/hive.

    Hive

    HIVE_PID_DIR

    Directory to store the Hive process ID. For example, /var/run/hive.

    WebHCat

    WEBHCAT_CONF_DIR

    Directory to store the WebHCat configuration files. For example, /etc/hcatalog/conf/webhcat.

    WebHCat

    WEBHCAT_LOG_DIR

    Directory to store the WebHCat logs. For example, var/log/webhcat.

    WebHCat

    WEBHCAT_PID_DIR

    Directory to store the WebHCat process ID. For example, /var/run/webhcat.

    HBase

    HBASE_CONF_DIR

    Directory to store the HBase configuration files. For example, /etc/hbase/conf.

    HBase

    HBASE_LOG_DIR

    Directory to store the HBase logs. For example, /var/log/hbase.

    HBase

    HBASE_PID_DIR

    Directory to store the HBase process ID. For example, /var/run/hbase.

    ZooKeeper

    ZOOKEEPER_DATA_DIR

    Directory where ZooKeeper will store data. For example, /grid/hadoop/zookeeper/data

    ZooKeeper

    ZOOKEEPER_CONF_DIR

    Directory to store the ZooKeeper configuration files. For example, /etc/zookeeper/conf.

    ZooKeeper

    ZOOKEEPER_LOG_DIR

    Directory to store the ZooKeeper logs. For example, /var/log/zookeeper.

    ZooKeeper

    ZOOKEEPER_PID_DIR

    Directory to store the ZooKeeper process ID. For example, /var/run/zookeeper.

    If you use the Companion files, the following provides a snapshot of how your directories.sh file should look after you edit the TODO variables:

    #!/bin/sh # # Directories Script # # 1. To use this script, you must edit the TODO variables below for your environment. # # 2. Warning: Leave the other parameters as the default values. Changing these default values will require you to # change values in other configuration files. # # # Hadoop Service - HDFS # # Space separated list of directories where NameNode will store file system image. For example, /grid/hadoop/hdfs/nn /grid1/hadoop/hdfs/nn DFS_NAME_DIR="TODO-LIST-OF-NAMENODE-DIRS"; # Space separated list of directories where DataNodes will store the blocks. For example, /grid/hadoop/hdfs/dn /grid1/hadoop/hdfs/dn /grid2/hadoop/hdfs/dn DFS_DATA_DIR="TODO-LIST-OF-DATA-DIRS"; # Space separated list of directories where SecondaryNameNode will store checkpoint image. For example, /grid/hadoop/hdfs/snn /grid1/hadoop/hdfs/snn /grid2/hadoop/hdfs/snn FS_CHECKPOINT_DIR="TODO-LIST-OF-SECONDARY-NAMENODE-DIRS"; # Directory to store the HDFS logs. HDFS_LOG_DIR="/var/log/hadoop/hdfs"; # Directory to store the HDFS process ID. HDFS_PID_DIR="/var/run/hadoop/hdfs"; # Directory to store the Hadoop configuration files. HADOOP_CONF_DIR="/etc/hadoop/conf"; # # Hadoop Service - YARN # # Space separated list of directories where YARN will store temporary data. For example, /grid/hadoop/yarn/local /grid1/hadoop/yarn/local /grid2/hadoop/yarn/local YARN_LOCAL_DIR="TODO-LIST-OF-YARN-LOCAL-DIRS"; # Directory to store the YARN logs. YARN_LOG_DIR="/var/log/hadoop/yarn"; # Space separated list of directories where YARN will store container log data. For example, /grid/hadoop/yarn/logs /grid1/hadoop/yarn/logs /grid2/hadoop/yarn/logs YARN_LOCAL_LOG_DIR="TODO-LIST-OF-YARN-LOCAL-LOG-DIRS"; # Directory to store the YARN process ID. YARN_PID_DIR="/var/run/hadoop/yarn"; # # Hadoop Service - MAPREDUCE # # Directory to store the MapReduce daemon logs. MAPRED_LOG_DIR="/var/log/hadoop/mapred"; # Directory to store the mapreduce jobhistory process ID. MAPRED_PID_DIR="/var/run/hadoop/mapred"; # # Hadoop Service - Hive # # Directory to store the Hive configuration files. HIVE_CONF_DIR="/etc/hive/conf"; # Directory to store the Hive logs. HIVE_LOG_DIR="/var/log/hive"; # Directory to store the Hive process ID. HIVE_PID_DIR="/var/run/hive"; # # Hadoop Service - WebHCat (Templeton) # # Directory to store the WebHCat (Templeton) configuration files. WEBHCAT_CONF_DIR="/etc/hcatalog/conf/webhcat"; # Directory to store the WebHCat (Templeton) logs. WEBHCAT_LOG_DIR="var/log/webhcat"; # Directory to store the WebHCat (Templeton) process ID. WEBHCAT_PID_DIR="/var/run/webhcat"; # # Hadoop Service - HBase # # Directory to store the HBase configuration files. HBASE_CONF_DIR="/etc/hbase/conf"; # Directory to store the HBase logs. HBASE_LOG_DIR="/var/log/hbase"; # Directory to store the HBase logs. HBASE_PID_DIR="/var/run/hbase"; # # Hadoop Service - ZooKeeper # # Directory where ZooKeeper will store data. For example, /grid1/hadoop/zookeeper/data ZOOKEEPER_DATA_DIR="TODO-ZOOKEEPER-DATA-DIR"; # Directory to store the ZooKeeper configuration files. ZOOKEEPER_CONF_DIR="/etc/zookeeper/conf"; # Directory to store the ZooKeeper logs. ZOOKEEPER_LOG_DIR="/var/log/zookeeper"; # Directory to store the ZooKeeper process ID. ZOOKEEPER_PID_DIR="/var/run/zookeeper"; # # Hadoop Service - Pig # # Directory to store the Pig configuration files. PIG_CONF_DIR="/etc/pig/conf"; # Directory to store the Pig logs. PIG_LOG_DIR="/var/log/pig"; # Directory to store the Pig process ID. PIG_PID_DIR="/var/run/pig"; # # Hadoop Service - Oozie # # Directory to store the Oozie configuration files. OOZIE_CONF_DIR="/etc/oozie/conf" # Directory to store the Oozie data. OOZIE_DATA="/var/db/oozie" # Directory to store the Oozie logs. OOZIE_LOG_DIR="/var/log/oozie" # Directory to store the Oozie process ID. OOZIE_PID_DIR="/var/run/oozie" # Directory to store the Oozie temporary files. OOZIE_TMP_DIR="/var/tmp/oozie"

  • The following table describes system user account and groups. Use this table to define what you are going to use in setting up your environment. These users and groups should reflect the accounts you create in Create System Users and Groups. The scripts.zip file you downloaded includes a script, usersAndGroups.sh, for setting user and group environment parameters.

    Define Users and Groups for Systems

    Parameter

    Definition

    HDFS_USER

    User that owns the HDFS services. For example, hdfs.

    YARN_USER

    User that owns the YARN services. For example, yarn.

    ZOOKEEPER_USER

    User that owns the ZooKeeper services. For example, zookeeper.

    HIVE_USER

    User that owns the Hive services. For example, hive.

    WEBHCAT_USER

    User that owns the WebHCat services. For example, hcat.

    HBASE_USER

    User that owns the HBase services. For example, hbase.

    OOZIE_USER

    User owning the Oozie services. For example oozie.

    HADOOP_GROUP

    A common group shared by services. For example, hadoop.

    KNOX_USER

    User that owns the Knox Gateway services. For example, knox.

    NAGIOS_USER

    User that owns the Nagios services. For example, nagios.

[Optional] Create System Users and Groups

In general Hadoop services should be owned by specific users and not by root or application users. The table below shows the typical users for Hadoop services. If you choose to install the PHD components using the RPMs, these users will automatically be set up.

If you do not install with the RPMs, or want different users, then you must identify the users that you want for your Hadoop services and the common Hadoop group and create these accounts on your system.

To create these accounts manually, you must:

  • Add the user to the group.

    useradd -G <groupname> <username>

  • Create the username directory.

    hdfs fs -mkdir /user/<username>

  • Give that account ownership over its directory.

    hdfs fs -chown <username>:<groupname> /user/<username>

Typical System Users and Groups

Hadoop Service

User

Group

HDFS

hdfs

hadoop

YARN

yarn

hadoop

MapReduce

mapred

hadoop, mapred

Hive

hive

hadoop

HCatalog/WebHCatalog

hcat

hadoop

HBase

hbase

hadoop

ZooKeeper

zookeeper

hadoop

Oozie

oozie

hadoop

Knox Gateway

knox

hadoop

Nagios

nagios

nagios

Determine PHD Memory Configuration Settings

Two methods can be used to determine YARN and MapReduce memory configuration settings:

The PHD utility script is the recommended method for calculating PHD memory configuration settings, but information about manually calculating YARN and MapReduce memory configuration settings is also provided for reference.

Running the PHD Utility Script

This section describes how to use the phd-configuration-utils.py Python script to calculate YARN, MapReduce, Hive, and Tez memory allocation settings based on the node hardware specifications. The phd-configuration-utils.py script is included in the PHD companion files.

To run the phd-configuration-utils.py script, execute the following command from the folder containing the script

python phd-configuration-utils.py options where options are as follows:

Option

Description

-c CORES

The number of cores on each host.

-m MEMORY

The amount of memory on each host in GB.

-d DISKS

The number of disks on each host.

-k HBASE

"True" if HBase is installed, "False" if not.

Example

Running the following command:

python phd-configuration-utils.py -c 16 -m 64 -d 4 -k True Would return: Using cores=16 memory=64GB disks=4 hbase=True Profile: cores=16 memory=49152MB reserved=16GB usableMem=48GB disks=4 Num Container=8 Container Ram=6144MB Used Ram=48GB Unused Ram=16GB yarn.scheduler.minimum-allocation-mb=6144 yarn.scheduler.maximum-allocation-mb=49152 yarn.nodemanager.resource.memory-mb=49152 mapreduce.map.memory.mb=6144 mapreduce.map.java.opts=-Xmx4096m mapreduce.reduce.memory.mb=6144 mapreduce.reduce.java.opts=-Xmx4096m yarn.app.mapreduce.am.resource.mb=6144 yarn.app.mapreduce.am.command-opts=-Xmx4096m mapreduce.task.io.sort.mb=1792 tez.am.resource.memory.mb=6144 tez.am.launch.cmd-opts =-Xmx4096m hive.tez.container.size=6144 hive.tez.java.opts=-Xmx4096m hive.auto.convert.join.noconditionaltask.size=1342177000 Write down these values as you will need them to configure hadoop.

Manually Calculating YARN and MapReduce Memory Configuration Settings

This section describes how to manually configure YARN and MapReduce memory allocation settings based on the node hardware specifications.

YARN takes into account all of the available compute resources on each machine in the cluster. Based on the available resources, YARN negotiates resource requests from applications (such as MapReduce) running in the cluster. YARN then provides processing capacity to each application by allocating Containers. A Container is the basic unit of processing capacity in YARN, and is an encapsulation of resource elements (memory, cpu etc.).

In a Hadoop cluster, it is vital to balance the usage of memory (RAM), processors (CPU cores) and disks so that processing is not constrained by any one of these cluster resources. As a general recommendation, allowing for two Containers per disk and per core gives the best balance for cluster utilization.

    When determining the appropriate YARN and MapReduce memory configurations for a cluster node, start with the available hardware resources. Specifically, note the following values on each node:
  • RAM (Amount of memory)

  • CORES (Number of CPU cores)

  • DISKS (Number of disks)

The total available RAM for YARN and MapReduce should take into account the Reserved Memory. Reserved Memory is the RAM needed by system processes and other Hadoop processes (such as HBase).

Reserved Memory = Reserved for stack memory + Reserved for HBase Memory (If HBase is on the same node).

Use the following table to determine the Reserved Memory per node.

Reserved Memory Recommendations

Total Memory per Node

Recommended Reserved System Memory

Recommended Reserved HBase Memory

4 GB

1 GB

1 GB

8 GB

2 GB

1 GB

16 GB

24 GB

2 GB

4 GB

2 GB

4 GB

48 GB

6 GB

8 GB

64 GB

8 GB

8 GB

72 GB

8 GB

8 GB

96 GB

128 GB

12 GB

24 GB

16 GB

24 GB

256 GB

32 GB

32 GB

512 GB

64 GB

64 GB

The next calculation is to determine the maximum number of containers allowed per node. The following formula can be used:

# of containers = min (2*CORES, 1.8*DISKS, (Total available RAM) / MIN_CONTAINER_SIZE)

Where DISKS is the value for dfs.data.dirs (# of data disks) per machine.

And MIN_CONTAINER_SIZE is the minimum container size (in RAM). This value is dependent on the amount of RAM available -- in smaller memory nodes, the minimum container size should also be smaller. The following table outlines the recommended values:

Total RAM per Node

Recommended Minimum Container Size

Less than 4 GB

256 MB

Between 4 GB and 8 GB

512 MB

Between 8 GB and 24 GB

1024 MB

Above 24 GB

2048 MB

The final calculation is to determine the amount of RAM per container:

RAM-per-container = max(MIN_CONTAINER_SIZE, (Total Available RAM) / containers))

With these calculations, the YARN and MapReduce configurations can be set:

Configuration File

Configuration Setting

Value Calculation

yarn-site.xml

yarn.nodemanager.resource.memory-mb

= containers * RAM-per-container

yarn-site.xml

yarn.scheduler.minimum-allocation-mb

= RAM-per-container

yarn-site.xml

yarn.scheduler.maximum-allocation-mb

= containers * RAM-per-container

mapred-site.xml

mapreduce.map.memory.mb

= RAM-per-container

mapred-site.xml        

mapreduce.reduce.memory.mb

= 2 * RAM-per-container

mapred-site.xml

mapreduce.map.java.opts

= 0.8 * RAM-per-container

mapred-site.xml

mapreduce.reduce.java.opts

= 0.8 * 2 * RAM-per-container

yarn-site.xml

yarn.app.mapreduce.am.resource.mb

= 2 * RAM-per-container

yarn-site.xml

yarn.app.mapreduce.am.command-opts

= 0.8 * 2 * RAM-per-container

Note: After installation, both yarn-site.xml and mapred-site.xml are located in the /etc/hadoop/conf folder.

Examples

Cluster nodes have 12 CPU cores, 48 GB RAM, and 12 disks.

Reserved Memory = 6 GB reserved for system memory + (if HBase) 8 GB for HBase Min container size = 2 GB

If there is no HBase:

# of containers = min (2*12, 1.8* 12, (48-6)/2) = min (24, 21.6, 21) = 21

RAM-per-container = max (2, (48-6)/21) = max (2, 2) = 2

Configuration

Value Calculation

yarn.nodemanager.resource.memory-mb

= 21 * 2 = 42*1024 MB

yarn.scheduler.minimum-allocation-mb

= 2*1024 MB

yarn.scheduler.maximum-allocation-mb

= 21 * 2 = 42*1024 MB

mapreduce.map.memory.mb

= 2*1024 MB

mapreduce.reduce.memory.mb         

= 2 * 2 = 4*1024 MB

mapreduce.map.java.opts

= 0.8 * 2 = 1.6*1024 MB

mapreduce.reduce.java.opts

= 0.8 * 2 * 2 = 3.2*1024 MB

yarn.app.mapreduce.am.resource.mb

= 2 * 2 = 4*1024 MB

yarn.app.mapreduce.am.command-opts

= 0.8 * 2 * 2 = 3.2*1024 MB

If HBase is included:

# of containers = min (2*12, 1.8* 12, (48-6-8)/2) = min (24, 21.6, 17) = 17

RAM-per-container = max (2, (48-6-8)/17) = max (2, 2) = 2

Configuration

Value Calculation

yarn.nodemanager.resource.memory-mb

= 17 * 2 = 34*1024 MB

yarn.scheduler.minimum-allocation-mb

= 2*1024 MB

yarn.scheduler.maximum-allocation-mb

= 17 * 2 = 34*1024 MB

mapreduce.map.memory.mb

= 2*1024 MB

mapreduce.reduce.memory.mb         

= 2 * 2 = 4*1024 MB

mapreduce.map.java.opts

= 0.8 * 2 = 1.6*1024 MB

mapreduce.reduce.java.opts

= 0.8 * 2 * 2 = 3.2*1024 MB

yarn.app.mapreduce.am.resource.mb

= 2 * 2 = 4*1024 MB

yarn.app.mapreduce.am.command-opts

= 0.8 * 2 * 2 = 3.2*1024 MB

    Notes:
  • Updating values for yarn.scheduler.minimum-allocation-mb without also changing yarn.nodemanager.resource.memory-mb, or changing yarn.nodemanager.resource.memory-mb without also changing yarn.scheduler.minimum-allocation-mb. causes changes the number of containers per node.

  • If your installation has high RAM but not many disks/cores, you can free up RAM for other tasks by lowering both yarn.scheduler.minimum-allocation-mb and yarn.nodemanager.resource.memory-mb.

  • With MapReduce on YARN, there are no longer pre-configured static slots for Map and Reduce tasks. The entire cluster is available for dynamic resource allocation of Map and Reduce tasks as needed by each job. In our example cluster, with the above configurations, YARN will be able to allocate up to 10 Mappers (40/4) or 5 Reducers (40/8) on each node (or some other combination of Mappers and Reducers within the 40 GB per node limit).

Configuring NameNode Heap Size

NameNode heap size depends on many factors such as the number of files, the number of blocks, and the load on the system. The following table provides recommendations for NameNode heap size configuration. These settings should work for typical Hadoop clusters where number of blocks is very close to number of files (generally the average ratio of number of blocks per file in a system is 1.1 to 1.2). Some clusters may require further tweaking of the following settings. Also, it is generally better to set the total Java heap to a higher value.

Number of files in millions

Total java heap (Xmx and Xms)

Young genaration size (-XX:NewSize -XX:MaxNewSize)

< 1 million files

1024m

128m

1-5 million files

3072m

512m

5-10

5376m

768m

10-20

9984m

1280m

20-30     

14848m

2048m

30-40

19456m

2560m

40-50

24320m

3072m

50-70

33536m

4352m

70-100

47872m

6144m

70-125

59648m

7680m

100-150

71424m

8960m

150-200

94976m

8960m

You should also set -XX:PermSize to 128m and -XX:MaxPermSize to 256m.

The following are the recommended settings for HADOOP_NAMENODE_OPTS in the hadoop-env.sh file (replace the ##### placeholder for -XX:NewSize, -XX:MaxNewSize, -Xms, and -Xmx with the recommended values from the table):

-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/$USER/hs_err_pid%p.log -XX:NewSize=##### -XX:MaxNewSize=##### -Xms##### -Xmx##### -XX:PermSize=128m -XX:MaxPermSize=256m -Xloggc:/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps  -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT ${HADOOP_NAMENODE_OPTS}

If the cluster uses a Secondary NameNode, you should also set HADOOP_SECONDARYNAMENODE_OPTS to HADOOP_NAMENODE_OPTS in the hadoop-env.sh file:

HADOOP_SECONDARYNAMENODE_OPTS=$HADOOP_NAMENODE_OPTS

Another useful HADOOP_NAMENODE_OPTS setting is -XX:+HeapDumpOnOutOfMemoryError. This option specifies that a heap dump should be executed when an out of memory error occurs. You should also use -XX:HeapDumpPath to specify the location for the heap dump file. For example:

-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=./etc/heapdump.hprof

Allocate Adequate Log Space for PHD

Logs are an important part of managing and operating your PHD cluster. The directories and disks that you assign for logging in PHD must have enough space to maintain logs during PHD operations. Allocate at least 10 GB of free space for any disk you want to use for PHD logging.

Installing HDFS and YARN

This section describes how to install the Hadoop Core components, HDFS, YARN, and MapReduce.

Set Default File and Directory Permissions

Set the default file and directory permissions to 0022 (022).

Use the umask command to confirm and set as necessary.

Ensure that the umask is set for all terminal sessions that you use during installation.

Install the Hadoop Packages

Execute the following command on all cluster nodes.

  • For RHEL/CentOS Linux:

    yum install hadoop hadoop-hdfs hadoop-libhdfs hadoop-yarn hadoop-mapreduce hadoop-client openssl

  • For SLES:

    zypper install hadoop hadoop-hdfs hadoop-libhdfs hadoop-yarn hadoop- mapreduce hadoop-client openssl

Install Compression Libraries

Make the following compression libraries available on all the cluster nodes.

Install Snappy

Install Snappy on all the nodes in your cluster. At each node:

  • For RHEL/CentOS Linux:

    yum install snappy snappy-devel

  • For SLES:

    zypper install snappy snappy-devel

Install LZO

Execute the following command at all the nodes in your cluster:

  • RHEL/CentOS Linux:

    yum install lzo lzo-devel hadooplzo hadooplzo-native

  • For SLES:

    zypper install lzo lzo-devel hadooplzo hadooplzo-native

Create Directories

Create directories and configure ownership + permissions on the appropriate hosts as described below.

Create the NameNode Directories

On the node that hosts the NameNode service, execute the following commands:

mkdir -p $DFS_NAME_DIR; chown -R $HDFS_USER:$HADOOP_GROUP $DFS_NAME_DIR; chmod -R 755 $DFS_NAME_DIR;

    where:
  • $DFS_NAME_DIR is the space separated list of directories where NameNode stores the file system image. For example, /grid/hadoop/hdfs/nn /grid1/hadoop/hdfs/nn.

  • $HDFS_USER is the user owning the HDFS services. For example, hdfs.

  • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

Create the SecondaryNameNode Directories

On all the nodes that can potentially run the SecondaryNameNode service, execute the following commands:

mkdir -p $FS_CHECKPOINT_DIR; chown -R $HDFS_USER:$HADOOP_GROUP $FS_CHECKPOINT_DIR; chmod -R 755 $FS_CHECKPOINT_DIR;

    where:
  • $FS_CHECKPOINT_DIR is the space-separated list of directories where SecondaryNameNode should store the checkpoint image. For example, /grid/hadoop/ hdfs/snn /grid1/hadoop/hdfssnn /grid2/hadoop/hdfs/snn.

  • $HDFS_USER is the user owning the HDFS services. For example, hdfs.

  • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

Create DataNode and YARN NodeManager Local Directories

At each DataNode, execute the following commands:

mkdir -p $DFS_DATA_DIR; chown -R $HDFS_USER:$HADOOP_GROUP $DFS_DATA_DIR; chmod -R 750 $DFS_DATA_DIR;

    where:
  • $DFS_DATA_DIR is the space-separated list of directories where DataNodes should store the blocks. For example, /grid/hadoop/hdfs/dn /grid1/hadoop/hdfs/dn / grid2/hadoop/hdfs/dn.

  • $HDFS_USER is the user owning the HDFS services. For example, hdfs.

  • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

At the ResourceManager and all DataNodes, execute the following commands:

mkdir -p $YARN_LOCAL_DIR; chown -R $YARN_USER:$HADOOP_GROUP $YARN_LOCAL_DIR; chmod -R 755 $YARN_LOCAL_DIR;

    where:
  • $YARN_LOCAL_DIR is the space separated list of directories where YARN should store temporary data. For example, /grid/hadoop/yarn/local /grid1/hadoop/ yarn/local /grid2/hadoop/yarn/local.

  • $YARN_USER is the user owning the YARN services. For example, yarn.

  • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

On the ResourceManager and all DataNodes, execute the following commands:

mkdir -p $YARN_LOCAL_LOG_DIR; chown -R $YARN_USER:$HADOOP_GROUP $YARN_LOCAL_LOG_DIR; chmod -R 755 $YARN_LOCAL_LOG_DIR;

    where:
  • $YARN_LOCAL_LOG_DIR is the space-separated list of directories where YARN should store temporary data. For example, /grid/hadoop/yarn/logs /grid1/hadoop/ yarn/logs /grid2/hadoop/yarn/local.

  • $YARN_USER is the user owning the YARN services. For example, yarn.

  • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

Create the Log and PID Directories

At all nodes, execute the following commands:

mkdir -p $HDFS_LOG_DIR; chown -R $HDFS_USER:$HADOOP_GROUP $HDFS_LOG_DIR; chmod -R 755 $HDFS_LOG_DIR;

where:

  • $HDFS_LOG_DIR is the directory for storing the HDFS logs.

    This directory name is a combination of a directory and the $HDFS_USER. For example, /var/log/hadoop/hdfs, where hdfs is the $HDFS_USER.

  • $HDFS_USER is the user owning the HDFS services. For example, hdfs.

  • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

mkdir -p $YARN_LOG_DIR; chown -R $YARN_USER:$HADOOP_GROUP $YARN_LOG_DIR; chmod -R 755 $YARN_LOG_DIR;

where:

  • $YARN_LOG_DIR is the directory for storing the YARN logs.

    This directory name is a combination of a directory and the $YARN_USER. For example, /var/log/hadoop/yarn, where yarn is the $YARN_USER.

  • $YARN_USER is the user owning the YARN services. For example, yarn.

  • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

mkdir -p $HDFS_PID_DIR; chown -R $HDFS_USER:$HADOOP_GROUP $HDFS_PID_DIR; chmod -R 755 $HDFS_PID_DIR

    where:
  • $HDFS_PID_DIR is the directory for storing the HDFS process ID.

    This directory name is a combination of a directory and the $HDFS_USER. For example, /var/run/hadoop/hdfs where hdfs is the $HDFS_USER.

  • $HDFS_USER is the user owning the HDFS services. For example, hdfs.

  • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

mkdir -p $YARN_PID_DIR; chown -R $YARN_USER:$HADOOP_GROUP $YARN_PID_DIR; chmod -R 755 $YARN_PID_DIR;

    where:
  • $YARN_PID_DIR is the directory for storing the YARN process ID.

    This directory name is a combination of a directory and the $YARN_USER. For example, /var/run/hadoop/yarn where yarn is the $YARN_USER.

  • $YARN_USER is the user owning the YARN services. For example, yarn.

  • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

mkdir -p $MAPRED_LOG_DIR; chown -R $MAPRED_USER:$HADOOP_GROUP $MAPRED_LOG_DIR; chmod -R 755 $MAPRED_LOG_DIR;

where:

  • $MAPRED_LOG_DIR is the directory for storing the JobHistory Server logs.

    This directory name is a combination of a directory and the $MAPREDs_USER. For example, /var/log/hadoop/mapred where mapred is the $MAPRED_USER.

  • $MAPRED_USER is the user owning the MAPRED services. For example, mapred.

  • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

mkdir -p $MAPRED_PID_DIR; chown -R $MAPRED_USER:$HADOOP_GROUP $MAPRED_PID_DIR; chmod -R 755 $MAPRED_PID_DIR;

where:

  • $MAPRED_PID_DIR is the directory for storing the JobHistory Server process ID.

    This directory name is a combination of a directory and the $MAPREDs_USER. For example, /var/run/hadoop/mapred where mapred is the $MAPRED_USER.

  • $MAPRED_USER is the user owning the MAPRED services. For example, mapred.

  • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

Install the ZooKeeper RPMs

    In a terminal window, type:
  • For RHEL/CentOS Linux

    yum install zookeeper

  • for SLES

    zypper install zookeeper

Set Directories and Permissions

Create directories and configure ownership and permissions on the appropriate hosts as described below.

    If any of these directories already exist, we recommend deleting and recreating them. Use the following instructions to create appropriate directories:
  • We strongly suggest that you edit and source the bash script files included with the companion files (downloaded in Download Companion Files).

    Alternatively, you can also copy the contents to your ~/.bash_profile) to set up these environment variables in your environment.

  • Execute the following commands on all nodes:

    mkdir -p $ZOO_LOG_DIR;chown -R $ZOOKEEPER_USER:$HADOOP_GROUP $ZOO_LOG_DIR; chmod -R 755 $ZOO_LOG_DIR; mkdir -p $ZOOPIDFILE;chown -R $ZOOKEEPER_USER:$HADOOP_GROUP $ZOOPIDFILE; chmod -R 755 $ZOOPIDFILE; mkdir -p $ZOO_DATA_DIR; chmod -R 755 $ZOO_DATA_DIR;chown -R $ZOOKEEPER_USER:$HADOOP_GROUP $ZOO_DATA_DIR

    where:

    • $ZOOKEEPER_USER is the user owning the ZooKeeper services. For example, zookeeper.

    • $ZOO_LOG_DIR is the directory to store the ZooKeeper logs. For example, / var/log/zookeeper.

    • $ZOOPIDFILE is the directory to store the ZooKeeper process ID. For example, /var/run/zookeeper.

    • $ZOO_DATA_DIR is the directory where ZooKeeper will store data. For example, /grid/hadoop/zookeeper/data.

  • Initialize the ZooKeeper data directories with the 'myid' file. Create one file per ZooKeeper server, and put the number of that server in each file.

    vi $ZOO_DATA_DIR/myid

    In the myid file on the first server, enter the corresponding number: 1

    In the myid file on the second server, enter the corresponding number: 2

    In the myid file on the second server, enter the corresponding number: 3

Set Up the Configuration Files

There are several configuration files that need to be set up for ZooKeeper.

  • Extract the ZooKeeper configuration files to a temporary directory.

    The files are located in the configuration_files/zookeeper directories where you decompressed the companion files.

  • Modify the configuration files.

    In the respective temporary directories, locate the following files and modify the properties based on your environment. Search for TODO in the files for the properties to replace.

    You must make changes to zookeeper-env.sh specific to your environment.

  • Edit zoo.cfg and modify the following properties:

    dataDir=$zk.data.directory.path server.1=$zk.server1.full.hostname:2888:3888 server.2=$zk.server2.full.hostname:2888:3888 server.3=$zk.server3.full.hostname:2888:3888

  • Edit hbase-site.xml and modify the following properties:

    <property> <name>hbase.zookeeper.quorum</name> <value>$zk.server1.full.hostname,$zk.server2.full.hostname,$zk.server3. full.hostname</value> <description>Comma separated list of Zookeeper servers (match to what is specified in zoo.cfg but without portnumbers)</description> </property>

  • Copy the configuration files

    • On all hosts create the config directory:

      rm -r $ZOOKEEPER_CONF_DIR ; mkdir -p $ZOOKEEPER_CONF_DIR ;

    • Copy all the ZooKeeper configuration files to the $ZOOKEEPER_CONF_DIR directory.

    • Set appropriate permissions:

      chmod a+x $ZOOKEEPER_CONF_DIR/; chown -R $ZOOKEEPER_USER: $HADOOP_GROUP $ZOOKEEPER_CONF_DIR/../; chmod -R 755 $ZOOKEEPER_CONF_DIR/../

      • $ZOOKEEPER_CONF_DIR is the directory to store the ZooKeeper configuration files. For example, /etc/zookeeper/conf.

      • $ZOOKEEPER_USER is the user owning the ZooKeeper services. For example, zookeeper.

Start ZooKeeper

To install and configure HBase and other Hadoop ecosystem components, you must start the ZooKeeper service and the ZKFC:

su - zookeeper -c "export ZOOCFGDIR=/usr/phd/current/zookeeper-server/conf ; export ZOOCFG=zoo.cfg; source /usr/phd/current/zookeeper-server/conf/zookeeper-env.sh ; /usr/phd/current/zookeeper-server/bin/zkServer.sh start" /usr/phd/current/hadoop-client/sbin/hadoop-daemon.sh start zkfc

  • $ZOOCFDIR is the directory where ZooKeeper server configs are stored.

Setting Up the Hadoop Configuration

This section describes how to set up and edit the deployment configuration files for HDFS and MapReduce.

    Use the following instructions to set up Hadoop configuration files:
  • We strongly suggest that you edit and source the bash script files included with the companion files (downloaded in Download Companion Files).

    Alternatively, you can set up these environment variables by copying the contents to your ~/.bash_profile).

  • Extract the core Hadoop configuration files to a temporary directory.

    The files are located in the configuration_files/core_hadoop directory where you decompressed the companion files.

  • Modify the configuration files.

    In the temporary directory, locate the following files and modify the properties based on your environment.

    Search for TODO in the files for the properties to replace. For further information, see Define Environment Parameters.

    • Edit core-site.xml and modify the following properties:

      <property> <name>fs.defaultFS</name> <value>hdfs://$namenode.full.hostname:8020</value> <description>Enter your NameNode hostname</description> </property>

    • Edit hdfs-site.xml and modify the following properties:

      <property> <name>dfs.namenode.name.dir</name> <value>/grid/hadoop/hdfs/nn,/grid1/hadoop/hdfs/nn</value> <description>Comma-separated list of paths. Use the list of directories from $DFS_NAME_DIR. For example, /grid/hadoop/hdfs/nn,/grid1/hadoop/hdfs/nn.</description> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:///grid/hadoop/hdfs/dn, file:///grid1/hadoop/hdfs/dn</value> <description>Comma-separated list of paths. Use the list of directories from $DFS_DATA_DIR. For example, file:///grid/hadoop/hdfs/dn, file:///grid1/ hadoop/hdfs/dn.</description> </property> <property> <name>dfs.namenode.http-address</name> <value>$namenode.full.hostname:50070</value> <description>Enter your NameNode hostname for http access.</description> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>$secondary.namenode.full.hostname:50090</value> <description>Enter your Secondary NameNode hostname.</description> </property> <property> <name>dfs.namenode.checkpoint.dir</name> <value>/grid/hadoop/hdfs/snn,/grid1/hadoop/hdfs/snn,/grid2/hadoop/hdfs/snn</value> <description>A comma-separated list of paths. Use the list of directories from $FS_CHECKPOINT_DIR. For example, /grid/hadoop/hdfs/snn,sbr/grid1/hadoop/hdfs/ snn,sbr/grid2/hadoop/hdfs/snn </description> </property> <property> <name>dfs.namenode.checkpoint.edits.dir</name> <value>/grid/hadoop/hdfs/snn,/grid1/hadoop/hdfs/snn,/grid2/hadoop/hdfs/snn</value> <description>A comma-separated list of paths. Use the list of directories from $FS_CHECKPOINT_DIR. For example, /grid/hadoop/hdfs/snn,sbr/grid1/hadoop/hdfs/ snn,sbr/grid2/hadoop/hdfs/snn </description> </property>

    • Edit yarn-site.xml and modify the following properties:

      <property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>$resourcemanager.full.hostname:8025</value> <description>Enter your ResourceManager hostname.</description> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>$resourcemanager.full.hostname:8030</value> <description>Enter your ResourceManager hostname.</description> </property> <property> <name>yarn.resourcemanager.address</name> <value>$resourcemanager.full.hostname:8050</value> <description>Enter your ResourceManager hostname.</description> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>$resourcemanager.full.hostname:8141</value> <description>Enter your ResourceManager hostname.</description> </property> <property> <name>yarn.nodemanager.local-dirs</name> <value>/grid/hadoop/yarn/local,/grid1/hadoop/yarn/local</value> <description>Comma separated list of paths. Use the list of directories from $YARN_LOCAL_DIR.For example, /grid/hadoop/yarn/local,/grid1/hadoop/yarn/ local.</description> </property> <property> <name>yarn.nodemanager.log-dirs</name> <value>/grid/hadoop/yarn/log</value> <description>Use the list of directories from $YARN_LOCAL_LOG_DIR. For example, /grid/hadoop/yarn/log,/grid1/hadoop/yarn/ log,/grid2/hadoop/yarn/log</description> </property> <property> <name>yarn.log.server.url</name> <value>http://$jobhistoryserver.full.hostname:19888/jobhistory/logs/</ value> <description>URL for job history server</description> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>$resourcemanager.full.hostname:8088</value> <description>URL for job history server</description> </property>

    • Edit mapred-site.xml and modify the following properties:

      <property> <name>mapreduce.jobhistory.address</name> <value>$jobhistoryserver.full.hostname:10020</value> <description>Enter your JobHistoryServer hostname.</description> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>$jobhistoryserver.full.hostname:19888</value> <description>Enter your JobHistoryServer hostname.</description> </property>

    • Optional: Configure MapReduce to use Snappy Compression

      To enable Snappy compression for MapReduce jobs, edit core-site.xml and mapred-site.xml.

      • Add the following properties to mapred-site.xml:

        <property> <name>mapreduce.admin.map.child.java.opts</name> <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/phd/current/hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value> <final>true</final> </property> <property> <name>mapreduce.admin.reduce.child.java.opts</name> <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/phd/current/hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value> <final>true</final> </property>

      • Add the SnappyCodec to the codecs list in core-site.xml:

        <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec</value> </property>

    • Replace the default memory configuration settings in yarn-site.xml and mapred-site.xml with the YARN and MapReduce memory configuration settings you calculated previously. Un-comment the memory/cpu settings and fill in the values that match what the documentation or helper scripts suggest for your environment.

    • Copy the configuration files.

      • On all hosts in your cluster, create the Hadoop configuration directory:

        rm -r $HADOOP_CONF_DIR mkdir -p $HADOOP_CONF_DIR

        where $HADOOP_CONF_DIR is the directory for storing the Hadoop configuration files. For example, /etc/hadoop/conf.

      • Copy all the configuration files to $HADOOP_CONF_DIR.

      • Set appropriate permissions:

        chown -R $HDFS_USER:$HADOOP_GROUP $HADOOP_CONF_DIR/../ chmod -R 755 $HADOOP_CONF_DIR/../

        where:

        • $HDFS_USER is the user owning the HDFS services. For example, hdfs.

        • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

Validating the Core Hadoop Installation

Format and Start HDFS

  • Execute these commands on the NameNode host machine:

    su - hdfs
    /usr/phd/current/hadoop-hdfs-namenode/../hadoop/bin/hdfs namenode -format
    /usr/phd/current/hadoop-hdfs-namenode/../hadoop/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start namenode
  • Execute these commands on the SecondaryNameNode:

    su - hdfs /usr/phd/current/hadoop-hdfs-secondarynamenode/../hadoop/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start secondarynamenode

  • Execute these commands on all DataNodes:

    su - hdfs /usr/phd/current/hadoop-hdfs-datanode/../hadoop/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start datanode

Smoke Test HDFS

  • First, see if you can reach the NameNode server with your browser:

    http://$namenode.full.hostname:50070

  • Create the hdfs user directory in HDFS:

    su $HDFS_USER hdfs dfs -mkdir -p /user/hdfs

  • Try copying a file into HDFS and listing that file:

    su $HDFS_USER hdfs dfs -copyFromLocal /etc/passwd passwd hdfs dfs -ls

  • Test browsing HDFS:

    http://$datanode.full.hostname:50075/browseDirectory.jsp?namenodeInfoPort= 50070&dir=/&nnaddr=$namenode.full.hostname:8020

Configure YARN and MapReduce

After you install Hadoop, modify your configs.

  • Upload the MapReduce tarball to HDFS. As the HDFS user, for example 'hdfs':

    hdfs dfs -mkdir -p /phd/apps/3.0.0.0-249/mapreduce/ hdfs dfs -put /usr/phd/current/hadoop-client/mapreduce.tar.gz /phd/apps/3.0.0.0-249/mapreduce/ hdfs dfs -chown -R hdfs:hadoop /phd hdfs dfs -chmod -R 555 /phd/apps/3.0.0.0-249/mapreduce hdfs dfs -chmod -R 444 /phd/apps/3.0.0.0-249/mapreduce/mapreduce.tar.gz

    su $HDFS_USER
    hdfs dfs -mkdir -p /phd/apps/<phd_version>/mapreduce/
    hdfs dfs -put /usr/phd/current/hadoop-client/mapreduce.tar.gz /phd/apps/<phd_version>/mapreduce/
    hdfs dfs -chown -R hdfs:hadoop /phd
    hdfs dfs -chmod -R 555 /phd/apps/<phd_version>/mapreduce
    hdfs dfs -chmod -R 444 /phd/apps/<phd_version>/mapreduce/mapreduce.tar.gz

    Where $HDFS_USER is the HDFS user, for example hdfs, and <phd_version> is the current PHD version, for example 3.0.0.0-249.

  • Copy mapred-site.xml from the companion files and make the following changes to mapred-site.xml:

    • Add:

      <property> <name>mapreduce.admin.map.child.java.opts</name> <value>-server -Djava.net.preferIPv4Stack=true -Dphd.version=${phd.version}</value> <final>true</final> </property>

    • Modify the following existing properties to include ${phd.version}:

      <property> <name>mapreduce.admin.user.env</name> <value>LD_LIBRARY_PATH=/usr/phd/${phd.version}/hadoop/lib/native:/usr/phd/${phd.version}/hadoop/lib/native/Linux-amd64-64</value> </property> <property> <name>mapreduce.application.framework.path</name> <value>/phd/apps/${phd.version}/mapreduce/mapreduce.tar.gz#mr-framework</value> </property> <property> <name>mapreduce.application.classpath</name> <value>$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/phd/${phd.version}/hadoop/lib/hadoop-lzo-0.6.0.${phd.version}.jar:/etc/hadoop/conf/secure</value> </property>

  • Copy yarn-site.xml from the companion files and modify:

    <property>      <name>yarn.application.classpath</name>      <value>$HADOOP_CONF_DIR,/usr/phd/${phd.version}/hadoop-client/*,/usr/phd/${phd.version}/hadoop-client/lib/*,/usr/phd/${phd.version}/hadoop-hdfs-client/*,/usr/phd/${phd.version}/hadoop-hdfs-client/lib/*,/usr/phd/${phd.version}/hadoop-yarn-client/*,/usr/phd/${phd.version}/hadoop-yarn-client/lib/*</value> </property>

  • For secure clusters, you must create and configure the container-executor.cfg configuration file:

    • Create the container-executor.cfg file in /etc/hadoop/conf/.

    • Insert the following properties:

      yarn.nodemanager.linux-container-executor.group=hadoop banned.users=hdfs,yarn,mapred min.user.id=1000

    • Set the file /etc/hadoop/conf/container-executor.cfg file permissions to only be readable by root:

      chown root:hadoop /etc/hadoop/conf/container-executor.cfg chmod 400 /etc/hadoop/conf/container-executor.cfg

    • Set the container-executor program so that only root or hadoop group users can execute it:

      chown root:hadoop /usr/phd/${phd.version}/hadoop-yarn/bin/container-executor chmod 6050 /usr/phd/${phd.version}/hadoop-yarn/bin/container-executor

Start YARN

  • Login as $YARN_USER, then execute these commands from the ResourceManager server:

    su -l yarn -c "/usr/phd/current/hadoop-yarn-resourcemanager/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager"

  • Login as $YARN_USER, then execute these commands from all NodeManager nodes:

    su -l yarn -c "/usr/phd/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager"

    where: $HADOOP_CONF_DIR is the directory for storing the Hadoop configuration files. For example, /etc/hadoop/conf.

Start MapReduce JobHistory Server

  • Change permissions on the container-executor file.

    chown -R root:hadoop /usr/phd/current/hadoop-yarn/bin/container-executor chmod -R 650 /usr/phd/current/hadoop-yarn/bin/container-executor

  • Execute these commands from the JobHistory server to set up directories on HDFS:

    su <HDFS_USER> hdfs dfs -mkdir -p /mr-history/tmp hdfs dfs -chmod -R 1777 /mr-history/tmp hdfs dfs -mkdir -p /mr-history/done hdfs dfs -chmod -R 1777 /mr-history/done hdfs dfs -chown -R $MAPRED_USER:$HDFS_USER /mr-history hdfs dfs -mkdir -p /app-logs hdfs dfs -chmod -R 1777 /app-logs hdfs dfs -chown <YARN_USER> /app-logs

  • Execute these commands from the JobHistory server:

    su -l yarn -c "/usr/phd/current/hadoop-mapreduce-historyserver/sbin/mr-jobhistory-daemon.sh --config $HADOOP_CONF_DIR start historyserver"

    • $HADOOP_CONF_DIR is the directory for storing the Hadoop configuration files. For example, /etc/hadoop/conf.

Smoke Test MapReduce

  • Try browsing to the ResourceManager:

    http://$resourcemanager.full.hostname:8088/

  • Create a $CLIENT_USER in all the nodes and add it in users group.

    useradd client usermod -a -G users client

  • As the HDFS user, create a /user/$CLIENT_USER.

    sudo su $HDFS_USER hdfs dfs -mkdir /user/$CLIENT_USER hdfs dfs -chown $CLIENT_USER:$CLIENT_USER /user/$CLIENT_USER hdfs dfs -chmod -R 755 /user/$CLIENT_USER

  • Run the smoke test as the $CLIENT_USER. Using Terasort, sort 10GB of data.

    su $CLIENT_USER /usr/phd/current/hadoop-client/bin/hadoop jar /usr/phd/current/hadoop-mapreduce-client/hadoop-mapreduce-examples-*.jar teragen 10000 tmp/teragenout /usr/phd/current/hadoop-client/bin/hadoop jar /usr/phd/current/hadoop-mapreduce-client/hadoop-mapreduce-examples-*.jar terasort tmp/teragenout tmp/terasortout

Installing HBase

This section describes installing and testing Apache HBase, a distributed, column-oriented database that provides the ability to access and manipulate data randomly in the context of the large blocks that make up HDFS.

Install the HBase RPMs

    In a terminal window, type:
  • For RHEL/CentOS Linux

    yum install hbase

  • For SLES

    zypper install hbase

Set Directories and Permissions

Create directories and configure ownership + permissions on the appropriate hosts as described below.

    If any of these directories already exist, we recommend deleting and recreating them. Use the following instructions to create appropriate directories:
  • We strongly suggest that you edit and source the bash script files included with the companion files (downloaded in Download Companion Files).

    Alternately, you can also copy the contents to your ~/.bash_profile) to set up these environment variables in your environment.

  • Execute the following commands on all nodes:

    mkdir -p $HBASE_LOG_DIR; chown -R $HBASE_USER:$HADOOP_GROUP $HBASE_LOG_DIR; chmod -R 755 $HBASE_LOG_DIR;

    mkdir -p $HBASE_PID_DIR; chown -R $HBASE_USER:$HADOOP_GROUP $HBASE_PID_DIR; chmod -R 755$HBASE_PID_DIR;

    where:

    • $HBASE_LOG_DIR is the directory to store the HBase logs. For example, /var/log/ hbase.

    • $HBASE_PID_DIR is the directory to store the HBase process ID. For example, /var/ run/hbase.

    • $HBASE_USER is the user owning the HBase services. For example, hbase.

    • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

Set Up the Configuration Files

There are several configuration files that need to be set up for HBase and ZooKeeper.

  • Extract the HBase configuration files to a temporary directory.

    The files are located in the configuration_files/hbase directory where you decompressed the companion files.

  • Modify the configuration files.

    In the respective temporary directories, locate the following files and modify the properties based on your environment. Search for TODO in the files for the properties to replace.

    • Edit zoo.cfg and modify the following properties:

      dataDir=$zk.data.directory.path

      server.1=$zk.server1.full.hostname:2888:3888

      server.2=$zk.server2.full.hostname:2888:3888

      server.3=$zk.server3.full.hostname:2888:3888

    • Edit hbase-site.xml and modify the following properties:

      <property>   <name>hbase.rootdir</name>   <value>hdfs://$hbase.namenode.full.hostname:8020/apps/hbase/data</value>   <description>Enter the HBase NameNode server hostname</description>   </property> <property>   <name>hbase.zookeeper.quorum</name>   <value>$zk.server1.full.hostname,$zk.server2.full.hostname,$zk.server3. full.hostname</value>   <description>Comma separated list of Zookeeper servers (match to what is specified in zoo.cfg but without portnumbers)</description> </property>

    • Edit the regionservers file and list all the RegionServers hostnames (separated by newline character) in your environment. For example, see the sample regionservers file with hostnames RegionServer1 through RegionServer9.

      RegionServer1 RegionServer2 RegionServer3 RegionServer4 RegionServer5 RegionServer6 RegionServer7 RegionServer8 RegionServer9

  • Copy the configuration files

    • On all hosts create the config directory:

      rm -r $HBASE_CONF_DIR ; mkdir -p $HBASE_CONF_DIR ;

      rm -r $ZOOKEEPER_CONF_DIR ;mkdir -p $ZOOKEEPER_CONF_DIR ;

    • Copy all the HBase configuration files to the $HBASE_CONF_DIR.

    • Set appropriate permissions:

      chmod a+x $HBASE_CONF_DIR/; chown -R $HBASE_USER:$HADOOP_GROUP $HBASE_CONF_DIR/../; chmod -R 755 $HBASE_CONF_DIR/../

        where:
      • $HBASE_CONF_DIR is the directory to store the HBase configuration files. For example, /etc/hbase/conf.

      • $HBASE_USER is the user owning the HBase services. For example, hbase.

Validate the Installation

Use these steps to validate your installation.

  • Start HBase.

    • Execute this command from the HBase Master node:

      su -l hbase -c "/usr/phd/current/hbase-master/bin/hbase-daemon.sh start master; sleep 25"

    • Execute this command from each HBase Region Server node:

      su -l hbase -c "/usr/phd/current/hbase-regionserver/bin/hbase-daemon.sh start regionserver"

  • Smoke Test HBase.

    From a terminal window, enter:

    su - $HBASE_USER hbase shell

    In the HBase shell, enter the following command:

    status

Starting the HBase Thrift and REST APIs

Administrators must manually start the Thrift and REST APIs for HBase.

Starting the HBase Thrift API

Run the following command to start the HBase Thrift API:

/usr/bin/hbase thrift start

Starting the HBase REST API

Run the following command to start the HBase REST API:

/usr/phd/current/hbase/bin/hbase-daemon.sh start rest --infoport 8085

Installing and Configuring Apache Tez

Apache Tez is an extensible YARN framework that can be used to build high-performance batch and interactive data processing applications. Tez dramatically improves MapReduce processing speed while maintaining its ability to scale to petabytes of data. Tez can also be used by other Hadoop ecosystem components such as Apache Hive and Apache Pig to dramatically improve query performance.

Prerequisites

Verify that your cluster meets the following pre-requisites before installing Tez:

  • PHD 3.0 or higher

Hadoop administrators can also install Tez using Ambari, which may reduce installation time by automating the installation across all cluster nodes.

Install the Tez RPM

    On all client/gateway nodes:
  • Install the Tez RPMs on all client/gateway nodes:

    • For RHEL/CentOS Linux:

      yum install tez
    • For SLES:

      zypper install tez
  • Execute the following commands from any one of the cluster client nodes to upload the Tez tarball into HDFS:

    su $HDFS_USER
    hdfs dfs -mkdir -p /phd/apps/<phd_version>/tez/
    hdfs dfs -put /usr/phd/<phd_version>/tez/lib/tez.tar.gz /phd/apps/<phd_version>/tez/
    hdfs dfs -chown -R hdfs:hadoop /phd
    hdfs dfs -chmod -R 555 /phd/apps/<phd_version>/tez
    hdfs dfs -chmod -R 444 /phd/apps/<phd_version>/tez/tez.tar.gz

    Where:

    $HDFS_USER is the user that owns the HDFS service. For example, hdfs. <phd_version> is the current PHD version, for example 3.0.0.0-249.

  • Execute the following command to verify that the files were copied with the preceding step:

    su $HDFS_USER
    hdfs dfs -ls /phd/apps/<phd_version>/tez

    This should return something similar to this:

    Found 1 items
    -r--r--r--   3 hdfs hadoop   36518223 2015-02-12 15:35 /phd/apps/3.0.0.0-249/tez/tez.tar.gz

Configure Tez

    Perform the following steps to configure Tez for your Hadoop cluster:
  • Create a tez-site.xml configuration file and place it in the /etc/tez/conf configuration directory. A sample tez-site.xml file is included in the configuration_files/tez folder in the PHD companion files.

  • Create the $TEZ_CONF_DIR environment variable and set it to to the location of the tez-site.xml file.

    export TEZ_CONF_DIR=/etc/tez/conf
  • Create the $TEZ_JARS environment variable and set it to the location of the Tez .jar files and their dependencies.

    export TEZ_JARS=/usr/phd/current/tez-client/*:/usr/phd/current/tez-client/lib/*
  • In the tez-site.xml file, configure the tez.lib.uris property with the HDFS path containing the Tez tarball file.

    ...
    <property>
      <name>tez.lib.uris</name>
      <value>/phd/apps/<phd_version>/tez/tez.tar.gz</value>
    </property>
    ...

    Where <phd_version> is the current PHD version, for example 3.0.0.0-249.

  • Add $TEZ_CONF_DIR and $TEZ_JARS to the $HADOOP_CLASSPATH environment variable.

    export HADOOP_CLASSPATH=$TEZ_CONF_DIR:$TEZ_JARS:$HADOOP_CLASSPATH
Tez Configuration Parameters

Configuration Parameter

Description

Default Value

tez.lib.uris

Comma-delimited list of the location of the Tez libraries which will be localized for DAGs. Specifying a single .tar.gz or .tgz assumes that a compressed version of the tez libs is being used. This is uncompressed into a tezlibs directory when running containers, and tezlibs/;tezlibs/lib/ are added to the classpath (after . and .*). If multiple files are specified - files are localized as regular files, contents of directories are localized as regular files (non-recursive).

/phd/apps/<phd_version>/tez/tez.tar.gz

tez.use.cluster.hadoop-libs

Specifies whether Tez will use the cluster Hadoop libraries. This property should not be set in tez-site.xml, or if it is set, the value should be false.

false

tez.cluster.additional.classpath.prefix

Specify additional classpath information to be used for Tez AM and all containers. This will be prepended to the classpath before all framework specific components have been specified.

/usr/phd/${phd.version}/hadoop/lib/hadoop-lzo-0.6.0.${phd.version}.jar:/etc/hadoop/conf/secure

tez.am.log.level

Root logging level passed to the Tez Application Master.

INFO

tez.generate.debug.artifacts

Generate debug artifacts such as a text representation of the submitted DAG plan

false

tez.staging-dir

The staging dir used while submitting DAGs.

/tmp/${user.name}/staging

tez.am.resource.memory.mb

The amount of memory to be used by the AppMaster. Used only if the value is not specified explicitly by the DAG definition.

TODO-CALCULATE-MEMORY-SETTINGS (place-holder for calculated value) Example value:1536

tez.am.launch.cluster-default.cmd-opts

Cluster default Java options for the Tez AppMaster process. These will be prepended to the properties specified via tez.am.launch.cmd-opts

-server -Djava.net.preferIPv4Stack=true -Dphd.version=${phd.version}

tez.task.resource.memory.mb

The amount of memory to be used by launched tasks. Used only if the value is not specified explicitly by the DAG definition.

1024

tez.task.launch.cluster-default.cmd-opts

Cluster default Java options for tasks. These will be prepended to the properties specified via tez.task.launch.cmd-opts

-server -Djava.net.preferIPv4Stack=true -Dphd.version=${phd.version}

tez.task.launch.cmd-opts

Java options for tasks. The Xmx value is derived based on tez.task.resource.memory.mb and is 80% of this value by default. Used only if the value is not specified explicitly by the DAG definition.

-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC

tez.task.launch.env

Additional execution environment entries for tez. This is not an additive property. You must preserve the original value if you want to have access to native libraries. Used only if the value is not specified explicitly by the DAG definition.

LD_LIBRARY_PATH=/usr/phd/${phd.version}/hadoop/lib/native:/usr/phd/${phd.version}/hadoop/lib/native/Linux-amd64-64<

tez.am.grouping.max-size

Specifies the upper bound of the size of the primary input to each task when the Tez Application Master determines the parallelism of primary input reading tasks. This configuration property prevents input tasks from being too large, which prevents their parallelism from being too small.

1073741824

tez.shuffle-vertex-manager.min-src-fraction

In case of a ScatterGather connection, the fraction of source tasks which should complete before tasks for the current vertex are scheduled.

0.2

tez.shuffle-vertex-manager.max-src-fraction

In case of a ScatterGather connection, once this fraction of source tasks have completed, all tasks on the current vertex can be scheduled. Number of tasks ready for scheduling on the current vertex scales linearly between min-fraction and max-fraction.

0.4

tez.am.am-rm.heartbeat.interval-ms.max

The maximum heartbeat interval between the AM and RM in milliseconds

250

tez.grouping.split-waves

The multiplier for available queue capacity when determining number of tasks for a Vertex. 1.7 with 100% queue available implies generating a number of tasks roughly equal to 170% of the available containers on the queue

1.7

tez.grouping.min-size

Lower bound on the size (in bytes) of a grouped split, to avoid generating too many splits

16777216

tez.grouping.max-size

Upper bound on the size (in bytes) of a grouped split, to avoid generating excessively large split

1073741824

tez.am.container.reuse.enabled

Configuration to specify whether container should be reused

true

tez.am.container.reuse.rack-fallback.enabled

Whether to reuse containers for rack local tasks. Active only if reuse is enabled.

true

tez.am.container.reuse.non-local-fallback.enabled

Whether to reuse containers for non-local tasks. Active only if reuse is enabled.

false

tez.am.container.idle.release-timeout-min.millis

The minimum amount of time to hold on to a container that is idle. Only active when reuse is enabled.

10000

tez.am.container.idle.release-timeout-max.millis

The maximum amount of time to hold on to a container if no task can be assigned to it immediately. Only active when reuse is enabled.

20000

tez.am.container.reuse.locality.delay-allocation-millis

The amount of time to wait before assigning a container to the next level of locality. NODE -> RACK -> NON_LOCAL

250

tez.am.max.app.attempts

Specifies the total number of time the app master will run in case recovery is triggered.

2

tez.am.maxtaskfailures.per.node

The maximum number of allowed task attempt failures on a node before it gets marked as blacklisted.

10

tez.task.am.heartbeat.counter.interval-ms.max

Time interval at which task counters are sent to the AM

4000

tez.task.get-task.sleep.interval-ms.max

The maximum amount of time, in seconds, to wait before a task asks an AM for another task

200

tez.task.max-events-per-heartbeat

Maximum number of of events to fetch from the AM by the tasks in a single heartbeat.

500

tez.session.client.timeout.secs

Time (in seconds) to wait for AM to come up when trying to submit a DAG from the client

-1

tez.session.am.dag.submit.timeout.secs

Time (in seconds) for which the Tez AM should wait for a DAG to be submitted before shutting down.

300

tez.counters.max

The number of allowed counters for the executing DAG.

2000

tez.counters.max.groups

The number of allowed counter groups for the executing DAG.

1000

tez.runtime.compress

Whether intermediate data should be compressed or not.

true

tez.runtime.compress.codec

The coded to be used if compressing intermediate data. Only applicable if tez.runtime.compress is enabled

org.apache.hadoop.io.compress.SnappyCodec

tez.runtime.io.sort.mb

The size of the sort buffer when output needs to be sorted.

512

tez.runtime.unordered.output.buffer.size-mb

The size of the buffer when output does not require to be sorted

100

tez.history.logging.service.class

The class to be used for logging history data. Set to org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService to log to ATS. Set to org.apache.tez.dag.history.logging.impl.SimpleHistoryLoggingService to log to the filesystem specified by ${fs.defaultFS}.

org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService

Validate the Tez Installation

Use the following procedure to run an example Tez application, such as OrderedWordCount, and validate your Tez installation.

  • Create a sample test.txt file:

    foo
    bar
    foo
    bar
    foo
  • Log in as the $HDFS_USER. The $HDFS_USER is the user that owns the HDFS service. For example, hdfs:

    su $HDFS_USER
  • Create a /tmp/input/ directory in HDFS and copy the test.txt file into that directory:

    hdfs dfs -mkdir -p /tmp/input/
    hdfs dfs -put test.txt /tmp/input/
  • Execute the following command to run the OrderedWordCount application using Tez:

    hadoop jar /usr/phd/current/tez-client/tez-examples-*.jar orderedwordcount /tmp/input/test.txt /tmp/out
  • Run the following command to verify the word count:

    hdfs dfs -cat '/tmp/out/*'

    This should return:

    foo 3
    bar 2

Troubleshooting

View the Tez logs to help troubleshoot problems with your installation and configuration. Tez logs are accessible through the YARN CLI using the yarn logs command.

yarn logs -applicationId <APPLICATION_ID> [OPTIONS]

You can find the application ID at the end of the output to a running application, as shown in the following output from the OrderedWordCount application.


14/02/26 14:45:33 INFO examples.OrderedWordCount: DAG 1 completed. FinalState= SUCCEEDED14/02/26
14:45:33 INFO client.TezSession: Shutting down Tez Session, sessionName=OrderedWordCountSession, applicationId=application_1393449467938_0001

Installing Apache Hive and Apache HCatalog

This section describes installing and testing Apache Hive, a tool for creating higher level SQL queries using HiveQL, the tool’s native language, that can be compiled into sequences of MapReduce programs. It also describes installing and testing Apache HCatalog, a metadata abstraction layer that insulates users and scripts from how and where data is physically stored.

Install the Hive-HCatalog RPM

  • On all client/gateway nodes (on which Hive programs will be executed), Hive Metastore Server, and HiveServer2 machine, install the Hive RPMs.

    • For RHEL/CentOS Linux:

      yum install hive-hcatalog

    • For SLES:

      zypper install hive-hcatalog

  • (Optional) Download and install the database connector .jar file for your Hive metastore database.

    By default, Hive uses an embedded Derby database for its metastore. However, you can choose to enable a local or remote database for the Hive metastore. Hive supports Derby, MySQL, Oracle, SQL Server, and Postgres.

    You will need to install the appropriate JDBC connector for your Hive metastore database. Pivotal recommends using an embedded instance of the Hive Metastore with HiveServer2. An embedded metastore runs in the same process with HiveServer2 rather than as a separate daemon.

      For example, if you previously installed MySQL, you would use the following steps to install the MySQL JDBC connector:
    • Execute the following command on the Hive metastore machine.

      [For RHEL/CENTOS LINUX] yum install mysql-connector-java*

      [For SLES] zypper install mysql-connector-java*

    • After the install, the MySQL connector .jar file is placed in the /usr/share/java/ directory. Copy the downloaded .jar file to the /usr/phd/current/hive/lib/ directory on your Hive host machine.

    • Verify that the .jar file has appropriate permissions.

Set Directories and Permissions

Create directories and configure ownership + permissions on the appropriate hosts as described below.

    If any of these directories already exist, we recommend deleting and recreating them. Use the following instructions to set up Pig configuration files:
  • We strongly suggest that you edit and source the bash script files included in the companion files.

    Alternately, you can also copy the contents to your ~/.bash_profile) to set up these environment variables in your environment.

  • Execute these commands on the Hive server machine:

    mkdir -p $HIVE_LOG_DIR ;
    chown -R $HIVE_USER:$HADOOP_GROUP $HIVE_LOG_DIR;
    chmod -R 755 $HIVE_LOG_DIR;
    

    where:

    • $HIVE_LOG_DIR is the directory for storing the Hive Server logs.

    • $HIVE_USER is the user owning the Hive services. For example, hive.

    • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

    This directory name is a combination of a directory and the $HIVE_USER.

Set Up the Hive/HCatalog Configuration Files

    Use the following instructions to set up the Hive/HCatalog configuration files:
  • If you have not already done so, download and extract the PHD companion files.

    A sample hive-site.xml file is included in the configuration_files/hive folder in the PHD companion files.

  • Modify the configuration files.

    In the configuration_files/hive directory, edit the hive-site.xml file and modify the properties based on your environment. Search for TODO in the files for the properties to replace.

    • Edit the connection properties for your Hive metastore database in hive-site.xml:

      
      <property>
          <name>javax.jdo.option.ConnectionURL</name>
          <value>jdbc:mysql://TODO-HIVE-METASTORE-DB-SERVER:TODO-HIVE-METASTORE-DB-PORT/TODO-HIVE-METASTORE-DB-NAME?createDatabaseIfNotExist=true</value>
          <description>Enter your Hive Metastore Connection URL, for example if MySQL: jdbc:mysql://localhost:3306/mysql?createDatabaseIfNotExist=true</description>
      </property>
      
      <property>
          <name>javax.jdo.option.ConnectionUserName</name>
          <value>TODO-HIVE-METASTORE-DB-USER-NAME</value>
          <description>Enter your Hive Metastore database user name.</description>
      </property>
      
      <property>
          <name>javax.jdo.option.ConnectionPassword</name>
          <value>TODO-HIVE-METASTORE-DB-PASSWORD</value>
          <description>Enter your Hive Metastore database password.</description>
      </property>
      
      <property>
          <name>javax.jdo.option.ConnectionDriverName</name>
          <value>TODO-HIVE-METASTORE-DB-CONNECTION-DRIVER-NAME</value>
          <description>Enter your Hive Metastore Connection Driver Name, for example if MySQL: com.mysql.jdbc.Driver</description>
      </property>

(Optional) If you want storage-based authorization for Hive, set the following Hive authorization parameters in the hive-site.xml file:


<property>
    <name>hive.security.authorization.enabled</name>
    <value>true</value>
</property>

<property>
     <name>hive.security.authorization.manager</name>
     <value>org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider</value>
</property>

<property>
    <name>hive.security.metastore.authorization.manager</name>
    <value>org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider</value>
</property>

<property>
    <name>hive.security.authenticator.manager</name>
    <value>org.apache.hadoop.hive.ql.security.ProxyUserAuthenticator</value>
</property> 

Hive also supports SQL standard authorization. See Hive Authorization for more information about Hive authorization models.

For a remote Hive metastore database, use the following hive-site.xml property value to set the IP address (or fully-qualified domain name) and port of the metastore host.


<property>
    <name>hive.metastore.uris</name>
    <value>thrift://$metastore.server.full.hostname:9083</value>
    <description>URI for client to contact metastore server. To enable HiveServer2, leave the property value empty.     
    </description>
</property>

To enable HiveServer2 for remote Hive clients, assign a value of a single empty space to this property. Pivotal recommends using an embedded instance of the Hive Metastore with HiveServer2. An embedded metastore runs in the same process with HiveServer2 rather than as a separate daemon. You can also configure HiveServer2 to use an embedded metastore instance from the command line:

hive --service hiveserver2 -hiveconf hive.metastore.uris=" "

(Optional) By default, Hive ensures that column names are unique in query results returned for SELECT statements by prepending column names with a table alias. Administrators who do not want a table alias prefix to table column names can disable this behavior by setting the following configuration property:

<property>
    <name>hive.resultset.use.unique.column.names</name>
    <value>false</value>
</property>

PHD-Utility script

You can also use the PHD utility script to fine-tune memory configuration settings based on node hardware specifications.

  • Copy the configuration files.

    • On all Hive hosts create the Hive configuration directory:

      rm -r $HIVE_CONF_DIR ; mkdir -p $HIVE_CONF_DIR;

    • Copy all the configuration files to the $HIVE_CONF_DIR directory.

    • Set appropriate permissions:

      
      chown -R $HIVE_USER:$HADOOP_GROUP $HIVE_CONF_DIR/../ ;
      chmod -R 755 $HIVE_CONF_DIR/../ ;

      where:

      • $HIVE_CONF_DIR is the directory to store the Hive configuration files. For example, /etc/hive/conf.

      • $HIVE_USER is the user owning the Hive services. For example, hive.

      • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

Configure Hive and HiveServer2 for Tez

The hive-site.xml file in the PHD companion files includes the settings for Hive and HiveServer2 for Tez.

If you have already configured the hive-site.xmlconnection properities for your Hive metastore database, the only remaining task would be to adjust hive.tez.container.size and hive.tez.java.opts values as described in the following section. You can also use the PHD utility script to calculate these Tez memory configuration settings.

Hive-on-Tez Configuration Parameters

Apart from the configurations generally recommended for Hive and HiveServer2 and included in the hive-site.xml file in the PHD companion files, for a multi-tenant use case, only the following configurations are required in the hive-site.xml configuration file to configure Hive for use with Tez.

Hive-Related Configuration Parameters

Configuration Parameter

Description

Default Value

hive.execution.engine

This setting determines whether Hive queries will be executed using Tez or MapReduce.

If this value is set to "mr", Hive queries will be executed using MapReduce. If this value is set to "tez", Hive queries will be executed using Tez. All queries executed through HiveServer2 will use the specified hive.execution.engine setting.

hive.tez.container.size

The memory (in MB) to be used for Tez tasks.

-1 (not specified) If this is not specified, the memory settings from the MapReduce configurations (mapreduce.map.memory.mb) will be used by default for map tasks.

hive.tez.java.opts

Java command line options for Tez.

If this is not specified, the MapReduce java opts settings (mapreduce.map.java.opts) will be used by default.

hive.server2.tez.default.queues

A comma-separated list of queues configured for the cluster.

The default value is an empty string, which prevents execution of all queries. To enable query execution with Tez for HiveServer2, this parameter must be configured.

hive.server2.tez.sessions.per.default.queue

The number of sessions for each queue named in the hive.server2.tez.default.queues.

1 Larger clusters may improve performance of HiveServer2 by increasing this number.

hive.server2.tez.initialize.default.sessions

Enables a user to use HiveServer2 without enabling Tez for HiveServer2. Users may potentially may want to run queries with Tez without a pool of sessions.

false

hive.server2.enable.doAs

Required when the queue-related configurations above are used.

false

Examples of Hive-Related Configuration Properties:

<property>     <name>hive.execution.engine</name>     <value>tez</value> </property> <property>     <name>hive.tez.container.size</name>     <value>-1</value>     <description>Memory in mb to be used for Tez tasks. If this is not specified (-1) then the memory settings for map tasks will be used from mapreduce configuration</description> </property> <property>     <name>hive.tez.java.opts</name>     <value></value>     <description>Java opts to be specified for Tez tasks. If this is not specified then java opts for map tasks will be used from mapreduce configuration</description> </property> <property>     <name>hive.server2.tez.default.queues</name>     <value>default</value> </property> <property>     <name>hive.server2.tez.sessions.per.default.queue</name>     <value>1</value> </property> <property>     <name>hive.server2.tez.initialize.default.sessions</name>     <value>false</value> </property> <property>     <name>hive.server2.enable.doAs</name>     <value>false</value> </property>

Using Hive-on-Tez with Capacity Scheduler

You can use the tez.queue.name property to specify which queue will be used for Hive- on-Tez jobs. You can also set this property in the Hive shell, or in a Hive script. For more details, see Configuring Tez with the Capacity Scheduler.

Set up OracleDB for use with Hive Metastore

To use OracleDB as the Hive Metastore database, you must have already installed PHD and Hive.

    To set up Oracle for use with Hive:
  • On the Hive Metastore host, install the appropriate JDBC .jar file.

  • Create a user for Hive and grant it permissions. Using the Oracle database admin utility:

    
    # sqlplus sys/root as sysdba
    CREATE USER $HIVEUSER IDENTIFIED BY $HIVEPASSWORD;
    GRANT SELECT_CATALOG_ROLE TO $HIVEUSER;
    GRANT CONNECT, RESOURCE TO $HIVEUSER;
    QUIT;

    Where $HIVEUSER is the Hive user name and $HIVEPASSWORD is the Hive user password.

  • Load the Hive database schema.

    sqlplus $HIVEUSER/$HIVEPASSWORD < hive-schema-0.14.0.oracle.sql

Create Directories on HDFS

  • Create the Hive user home directory on HDFS.

    Login as $HDFS_USER

    hdfs dfs -mkdir -p /user/$HIVE_USER
    hdfs dfs -chown $HIVE_USER:$HDFS_USER /user/$HIVE_USER

  • Create the warehouse directory on HDFS.

    Login as $HDFS_USER

    hdfs dfs -mkdir -p /apps/hive/warehousehadoop
    hdfs dfs -chown -R $HIVE_USER:$HDFS_USER /apps/hive
    hdfs dfs -chmod -R 775 /apps/hive

    where:

    • $HDFS_USER is the user owning the HDFS services. For example, hdfs.

    • $HIVE_USER is the user owning the Hive services. For example, hive.

  • Create the Hive scratch directory on HDFS.

    Login as $HDFS_USER

    hdfs dfs -mkdir -p /tmp/scratchhadoop fs -chown -R $HIVE_USER:$HDFS_USER /tmp/scratch
    hdfs dfs -chmod -R 777 /tmp/scratch

    where:

    • $HDFS_USER is the user owning the HDFS services. For example, hdfs.

    • $HIVE_USER is the user owning the Hive services. For example, hive.

Validate the Installation

    Use the following steps to validate your installation:
  • Initialize the Hive Metastore database schema.

    $HIVE_HOME/bin/schematool -initSchema -dbType $databaseType

    The value for $databaseType can be derby, mysql, oracle, mssql, or postgres.

    $HIVE_HOME is by default configured to usr/phd/current/hive.

  • Turn off autocreation of schemas. Edit hive-site.xml to set the value of datanucleus.autoCreateSchema to false:

    <property>
        <name>datanucleus.autoCreateSchema</name>
        <value>false</value>
        <description>Creates necessary schema on a startup if one doesn't exist</ description>
    </property>

  • Start the Hive Metastore service.

    su - hive
    nohup /usr/phd/current/hive-metastore/bin/hive --service metastore>/var/log/hive/hive.out 2>/var/log/hive/hive.log &

  • Smoke Test Hive.

    • Open Hive command line shell.

      hive

    • Run sample commands.

      show databases;create table test(col1 int, col2 string); show tables;

  • Start HiveServer2:

    • 
      su - hive
      /usr/phd/current/hive-server2/bin/hiveserver2 >/var/log/hive/hiveserver2.out 2> /var/log/hive/hiveserver2.log &

  • Smoke test HiveServer2.

    • Open Beeline command line shell to interact with HiveServer2.

      /usr/phd/current/hive/bin/beeline

    • Establish connection to server.

      !connect jdbc:hive2://$hive.server.full.hostname:10000 $HIVE_USERpassword org.apache.hive.jdbc.HiveDriver

    • Run sample commands.

      show databases;
      create table test2(a int, b string);
      show tables;

Enable Tez for Hive Queries

Limitations

    This release of Tez does not support the following actions:
  • SMB joins

  • SELECT TRANSFORM queries

  • Index creation

  • Skew joins

    Use the following instructions to enable Tez for Hive Queries:
  • Copy the hive-exec-0.13.0.jar to HDFS at the following location: /apps/hive/ install/hive-exec-0.13.0.jar.

    su - $HIVE_USERhadoop fs -mkdir /apps/hive/installhadoop fs -copyFromLocal /usr/phd/current/hive/lib/hive-exec-* /apps/hive/install/ hive-exec-0.13.0.jar

  • Enable Hive to use Tez DAG APIs. On the Hive client machine, add the following to your Hive script or execute it in the Hive shell:

    set hive.execution.engine=tez;

Disabling Tez for Hive Queries

Use the following instructions to disable Tez for Hive queries:

On the Hive client machine, add the following to your Hive script or execute it in the Hive shell:

set hive.execution.engine=mr;

Tez will then be disabled for Hive queries.

Configuring Tez with the Capacity Scheduler

You can use the tez.queue.name property to specify which queue will be used for Tez jobs. Currently the Capacity Scheduler is the default Scheduler in PHD. In general, this is not limited to the Capacity Scheduler, but applies to any YARN queue.

If no queues have been configured, the default queue will be used, which means that 100% of the cluster capacity will be used when running Tez jobs. If queues have been configured, a queue name must be configured for each YARN application.

Setting tez.queue.name in tez-site.xml would apply to Tez applications that use that configuration file. To assign separate queues for each application, you would need separate tez-site.xml files, or you could have the application pass this configuration to Tez while submitting the Tez DAG.

For example, in Hive you would use the the tez.queue.name property in hive-site.xml to specify the queue to be used for Hive-on-Tez jobs. To assign Hive-on-Tez jobs to use the "engineering" queue, you would add the following property to hive-site.xml:

<property>
    <name>tez.queue.name</name>
    <value>engineering</value>
</property>

Setting this configuration property in hive-site.xml will affect all Hive queries that read that configuration file.

To assign Hive-on-Tez jobs to use the "engineering" queue in a Hive query, you would use the following commands in the Hive shell or in a Hive script:

bin/hive --hiveconf
tez.queue.name=engineering

Validate Hive-on-Tez Installation

    Use the following procedure to validate your configuration of Hive-on-Tez:
  • Create a sample test.txt file:

    echo -e "alice miller\t49\t3.15" > student.txt

  • Upload the new data file to HDFS:

    su $HDFS_USER hadoop fs -mkdir -p /user/test/studenthadoop fs -copyFromLocal student.txt /user/test/student

  • Open the Hive command-line shell:

    su $HDFS_USER hive

  • Create a table named student in Hive:

    hive> CREATE EXTERNAL TABLE student(name string, age int, gpa double) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'STORED AS TEXTFILE LOCATION '/user/test/student';

  • Execute the following query in Hive:

    hive> SELECT COUNT(*) FROM student;

    If Hive-on-Tez is configured properly, this query should return successful results:

    hive> SELECT COUNT(*) FROM student;
    Query ID = hdfs_20140604161313_544c4455-dfb3-4119-8b08-b70b46fee512
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
    Starting Job = job_1401734196960_0007, Tracking URL = http://c6401.ambari.apache.org:8088/proxy/application_1401734196960_0007/
    Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1401734196960_0007
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
    2014-06-04 16:13:24,116 Stage-1 map = 0%,  reduce = 0%
    2014-06-04 16:13:30,670 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.82 sec
    2014-06-04 16:13:39,065 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.97 sec
    MapReduce Total cumulative CPU time: 1 seconds 970 msec
    Ended Job = job_1401734196960_0007
    MapReduce Jobs Launched:
    Job 0: Map: 1  Reduce: 1   Cumulative CPU: 1.97 sec   HDFS Read: 240 HDFS Write: 2 SUCCESS
    Total MapReduce CPU Time Spent: 1 seconds 970 msec
    OK
    1
    Time taken: 28.47 seconds, Fetched: 1 row(s)
    hive> 

Installing Apache Pig

This section describes installing and testing Apache Pig, a platform for creating higher level data flow programs that can be compiled into sequences of MapReduce programs, using Pig Latin, the platform’s native language.

Install the Pig RPMs

On all the hosts where you will execute Pig programs, install the RPMs.

  • For RHEL or CentOS:

    yum install pig

  • For SLES:

    zypper install pig

The RPM will install Pig libraries to /usr/phd/current/pig. Pig configuration files are placed in /usr/phd/current/pig/conf.

Set Up Configuration Files

    Use the following instructions to set up configuration files for Pig:
  • Extract the Pig configuration files.

    From the downloaded scripts.zip file, extract the files from the configuration_files/pig directory to a temporary directory.

  • Copy the configuration files.

    • On all hosts where Pig will be executed, create the Pig configuration directory:

      rm -r $PIG_CONF_DIR mkdir -p $PIG_CONF_DIR

    • Copy all the configuration files to $PIG_CONF_DIR.

    • Set appropriate permissions:

      chmod -R 755 $PIG_CONF_DIR

      where $PIG_CONF_DIR is the directory to store Pig configuration files. For example, /etc/pig/conf.

Validate the Installation

    Use the following steps to validate your installation:
  • On the host machine where Pig is installed execute the following commands:

    su $HDFS_USER/usr/phd/current/hadoop/bin/hadoop fs -copyFromLocal /etc/passwd passwd

  • Create the pig script file /tmp/id.pig with the following contents:

    A = load 'passwd' using PigStorage(':');B = foreach A generate \$0 as id; store B into '/tmp/id.out';

  • Execute the Pig script:

    su $HDFS_USER pig -l /tmp/pig.log /tmp/id.pig

Installing WebHCat

This section describes installing and testing WebHCat, which provides a REST interface to Apache HCatalog services like job submission and eventing.

Install the WebHCat RPMs

On the WebHCat server machine, install the necessary RPMs.

  • For RHEL/CentOS Linux:

    yum -y install hive-hcatalog hive-webhcatyum -y install webhcat-tar-hive webhcat-tar-pig

  • For SLES:

    zypper install hcatalog webhcat-tar-hive webhcat-tar-pig

Upload the Pig, and Hive tarballs to HDFS

Upload the Pig and Hive tarballs to HDFS as the $HDFS_User. In this example, hdfs:

hdfs dfs -mkdir -p /phd/apps/3.0.0.0-<$version>/pig/
hdfs dfs -mkdir -p /phd/apps/3.0.0.0-<$version>/hive/
hdfs dfs -put /usr/phd/3.0.0.0-<$version>/pig/pig.tar.gz /phd/apps/3.0.0.0-<$version>/pig/
hdfs dfs -put /usr/phd/3.0.0.0-<$version>/hive/hive.tar.gz /phd/apps/3.0.0.0-<$version>/hive/
hdfs dfs -chmod -R 555 /phd/apps/3.0.0.0-<$version>/pig
hdfs dfs -chmod -R 444 /phd/apps/3.0.0.0-<$version>/pig/pig.tar.gz
hdfs dfs -chmod -R 555 /phd/apps/3.0.0.0-<$version>/hive
hdfs dfs -chmod -R 444 /phd/apps/3.0.0.0-<$version>/hive/hive.tar.gz
hdfs dfs -chown -R hdfs:hadoop /phd

Set Directories and Permissions

Create directories and configure ownership and permissions on the appropriate hosts as described below.

    If any of these directories already exist, we recommend deleting and recreating them. Use the following instructions to set up Pig configuration files:
  • We strongly suggest that you edit and source the bash script files included in the companion files (downloaded in Download Companion Files). Alternately, you can copy the contents to your ~/.bash_profile) to set up these environment variables in your environment.

  • Execute these commands on your WebHCat server machine to create log and pid directories.

    mkdir -p $WEBHCAT_LOG_DIR chown -R $WEBHCAT_USER:$HADOOP_GROUP $WEBHCAT_LOG_DIR chmod -R 755 $WEBHCAT_LOG_DIR

    mkdir -p $WEBHCAT_PID_DIR chown -R $WEBHCAT_USER:$HADOOP_GROUP $WEBHCAT_PID_DIR chmod -R 755 $WEBHCAT_PID_DIR

    where:

    • $WEBHCAT_LOG_DIR is the directory to store the WebHCat logs. For example, var/log/webhcat.

    • $WEBHCAT_PID_DIR is the directory to store the WebHCat process ID. For example, /var/run/webhcat.

    • $WEBHCAT_USER is the user owning the WebHCat services. For example, hcat.

    • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

  • Set permissions for the for the WebHCat server to impersonate users on the Hadoop cluster. Create a Unix user who to run the WebHCat server. Modify the Hadoop core-site.xml file and set these properties:

    Variable

    Value

    hadoop.proxyuser.USER.groups

    A comma-separated list of the Unix groups whose users will be impersonated.

    hadoop.proxyuser.USER.hosts

    A comma-separated list of the hosts that will run the HCatalog and JobTracker servers.

  • If you are running WebHCat on a secure cluster, create a Kerberos principal for the WebHCat server with the name USER/host@realm.

    Also, set the WebHCat configuration variables templeton.kerberos.principal and templeton.kerberos.keytab.

Modify WebHCat Configuration Files

    Use the following instructions to modify the WebHCat config files:
  • Extract the WebHCat configuration files to a temporary directory.

    The files are located in the configuration_files/webhcat directory where you decompressed the companion files.

  • Modify the configuration files.

    In the temporary directory, locate the following files and modify the properties based on your environment.

    Search for TODO in the files for the properties to replace.

    • Edit the WebHcat companion files.

      • Edit the webhcat-site.xml and modify the following properties:

        <property>
             <name>templeton.hive.properties</name>                 <value>hive.metastore.local=false,hive.metastore.uris=thrift://TODO-METASTORE-HOSTNAME:9083,hive.metastore.sasl.enabled=yes,hive.metastore.execute.setugi=true,hive.metastore.warehouse.dir=/apps/hive/warehouse</value>
             <description>Properties to set when running Hive.</description>
        </property>
        
        <property>
             <name>templeton.zookeeper.hosts</name>
             <value>$zookeeper1.full.hostname:2181,$zookeeper1.full.hostname:2181,..</value>
             <description>ZooKeeper servers, as comma separated HOST:PORT pairs.</description>
        </property>
      • In core-site.xml, make sure the following properties are also set to allow WebHcat to impersonate groups and hosts:

        <property>
            <name>hadoop.proxyuser.hcat.groups</name>
            <value>*</value>
         </property>
        
         <property>
             <name>hadoop.proxyuser.hcat.hosts</name>
             <value>*</value>
         </property> 

        where:

        • hadoop.proxyuser.hcat.group

          is a comma-separated list of the Unix groups whose users may be impersonated by

        • hadoop.proxyuser.hcat.hosts

          is a comma-separated list of the hosts are allowed to submit requests using hcat

  • Set up the updated WebHCat configuration files.

    • Delete any existing WebHCat configuration files:

      rm -rf  $WEBHCAT_CONF_DIR/*
    • Copy all of the modified config files to $WEBHCAT_CONF_DIR and set appropriate permissions:

      chown -R $WEBHCAT_USER:$HADOOP_GROUP $WEBHCAT_CONF_DIR
      chmod -R 755 $WEBHCAT_CONF_DIR

      where:

      • $WEBHCAT_CONF_DIR is the directory to store theWebHCat configuration files. For example, /etc/hcatalog/conf/webhcat.

      • $WEBHCAT_USER is the user owning the WebHCat services. For example, hcat.

      • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

Set Up HDFS User and Prepare WebHCat Directories

  • Set up the WebHCat user.

    Login as $WEBHCAT_USER hdfs fs -mkdir /user/$WEBHCAT_USER hdfs -chown -R $WEBHCAT_USER:$WEBHCAT_USER /user/$WEBHCAT_USER hdfs fs -mkdir /apps/webhcat

  • Prepare WebHCat directories on HDFS.

    hdfs dfs -copyFromLocal /usr/share/PHD-webhcat/pig.tar.gz /apps/webhcat/ hdfs dfs -copyFromLocal /usr/share/PHD-webhcat/hive.tar.gz /apps/webhcat/ hdfs dfs -copyFromLocal /usr/phd/current/hadoop-mapreduce/hadoop-streaming*.jar / apps/webhcat/

  • Set appropriate permissions for the HDFS user and the webhcat directory.

    hdfs fs -chown -R $WEBHCAT_USER:users /apps/webhcat hdfs fs -chmod -R 755 /apps/webhcat

    where:

    • $WEBHCAT_USER is the user owning the WebHCat services. For example, hcat.

Validate the Installation

  • Start the WebHCat server. Login as $WEBHCAT_USER:

    su -l -hcat -c '/usr/phd/current/hive-webhcat/sbin/webhcat_server.sh --config /etc/hive-webhcat/conf/ start'

  • From the browser, type:

    http://$WebHCat.server.full.hostname:50111/templeton/v1/status

    You should see the following output:

    {"status":"ok","version":"v1"}

Installing Apache Oozie

This section describes installing and testing Apache Oozie, a server based workflow engine optimized for running workflows that execute Hadoop jobs.

Install the Oozie RPMs

  • On the Oozie server, install the necessary RPMs.

    • For RHEL/CentOS Linux:

      yum install oozie oozie-client

    • For SLES:

      zypper install oozie oozie-client

  • Optional - Enable the Oozie Web Console

    • Create a lib extension directory.

      cd /usr/phd/current/oozie

    • Add the Ext library to the Oozie application.

      • For RHEL/CentOS Linux:

        yum install ext-2.2-1 cp /usr/share/PHD-oozie/ext-2.2.zip /usr/phd/current/oozie-client/libext/

      • For SLES:

        zypper install ext-2.2-1 cp /usr/share/PHD-oozie/ext-2.2.zip /usr/phd/current/oozie-client/libext/

    • Add LZO JAR files.

      cp /usr/phd/current/hadoop/lib/hadooplzo-*.jar libext/

      To find hadooplzo-*.jar, remember the product version when you installed. For example, if you installed 3.0.0.0-249, you can find and copy the hadooplzo-* jar:

      /grid/0/phd/3.0.0.0-249/hadoop/lib/ cp /grid/0/phd/3.0.0.0-249/hadoop/lib/hadooplzo-0.6.0.3.0.0.0-249.jar /usr/phd/current/oozie-client/libext/

  • Optional: Add database connector JAR files.

      For MySQL:
    • Copy your msql driver jar to libext directory.

      cp mysql-connector-java.jar /usr/phd/current/oozie-client/libext/

      For Oracle:
    • Copy your oracle driver jar to the libext directory.

      cp ojdbc6.jar /usr/phd/current/oozie-client/libext/

  • Make the following changes in /etc/oozie/conf/oozie-env.sh: Change:

    export CATALINA_BASE=${CATALINA_BASE:-/usr/phd/3.0.0.0-<$version>/oozie-server}

    To:

    export CATALINA_BASE=${CATALINA_BASE:-/usr/phd/current/oozie-client/oozie-server}

    Where <$version> is the build number of the release.

  • Create a war file to oozie to be ready

    cd $OOZIE_HOME - {/usr/phd/current/oozie-client} bin/oozie-setup.sh prepare-war

Set Directories and Permissions

Create directories and configure ownership and permissions on the appropriate hosts as described below.

    If any of these directories already exist, delete and recreate them. Use the following instructions to set up Oozie configuration files:
  • We strongly suggest that you edit and source the bash script files included in the companion files (downloaded in Download Companion Files).

    Alternately, you can also copy the contents to your ~/.bash_profile) to set up these environment variables in your environment.

  • Run the following commands on your Oozie server:

    mkdir -p $OOZIE_DATA;chown -R $OOZIE_USER:$HADOOP_GROUP $OOZIE_DATA;chmod -R 755 $OOZIE_DATA;

    mkdir -p $OOZIE_LOG_DIR;chown -R $OOZIE_USER:$HADOOP_GROUP $OOZIE_LOG_DIR;chmod -R 755 $OOZIE_LOG_DIR;

    mkdir -p $OOZIE_PID_DIR;chown -R $OOZIE_USER:$HADOOP_GROUP $OOZIE_PID_DIR;chmod -R 755 $OOZIE_PID_DIR;

    mkdir -p $OOZIE_TMP_DIR;chown -R $OOZIE_USER:$HADOOP_GROUP $OOZIE_TMP_DIR;chmod -R 755 $OOZIE_TMP_DIR;

    mkdir /etc/oozie/conf/action-confchown -R $OOZIE_USER:$HADOOP_GROUP $OOZIE_TMP_DIR;chmod -R 755 $OOZIE_TMP_DIR;

    where:

    • $OOZIE_DATA is the directory to store the Oozie data. For example, /var/db/oozie.

    • $OOZIE_LOG_DIR is the directory to store the Oozie logs. For example, /var/log/oozie.

    • $OOZIE_PID_DIR is the directory to store the Oozie process ID. For example, /var/run/oozie.

    • $OOZIE_TMP_DIR is the directory to store the Oozie temporary files. For example, /var/tmp/oozie.

    • $OOZIE_USER is the user owning the Oozie services. For example, oozie.

    • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

Set Up the Oozie Configuration Files

    Complete the following instructions to set up Oozie configuration files:
  • Extract the Oozie configuration files to a temporary directory.

    The files are located in the configuration_files/oozie directory where you decompressed the companion files.

  • Edit oozie-log4j.properties to add the following property:

    log4j.appender.oozie.layout.ConversionPattern=%d{ISO8601} %5p %c{1}:%L - SERVER[${oozie.instance.id}] %m%n where ${oozie.instance.id} is determined by oozie, automatically.

  • Modify the configuration files, based on your environment and Oozie database type.

For Derby:

In the temporary directory, locate the following file and modify the properties. Search for TODO in the files for the properties to replace.

Edit oozie-site.xml and modify the following properties:

<property>     <name>oozie.base.url</name>     <value>http://$oozie.full.hostname:11000/oozie</value>     <description>Enter your Oozie server hostname.</description> </property>

<property>     <name>oozie.service.Store­Service.jdbc.url</name>     <value>jdbc:derby:$OOZIE_DATA_DIR/$soozie.db.schema.name-db;create=true</value> </property>

<property>     <name>oozie.service.JPAService.jdbc.driver</name>     <value>org.apache.derby.jdbc.EmbeddedDriver</value> </property> <property>     <name>oozie.service.JPAService.jdbc.driver</name>     <value>org.apache.derby.jdbc.EmbeddedDriver</value> </property>

<property>     <name>oozie.service.JPAService.jdbc.username</name>     <value>$OOZIE_DBUSER</value> </property>

<property>     <name>oozie.service.JPAService.jdbc.password</name>     <value>$OOZIE_DBPASSWD</value> </property>

<property>     <name>oozie.service.WorkflowAppService.system.libpath</name>     <value>/user/$OOZIE_USER/share/lib</value> </property>

For MySQL:

  • Install and start MySQL 5.x.

    Note See Metastore Database Requirements for supported versions and Installing and Configuring MySQL 5.x for instructions.

  • Create the Oozie database and Oozie MySQL user. For example, using the MySQL mysql command-line tool:

    $ mysql -u root -p Enter password: ****** mysql> create database oozie; Query OK, 1 row affected (0.03 sec) mysql> grant all privileges on oozie.* to 'oozie'@'localhost' identified by 'oozie'; Query OK, 0 rows affected (0.03 sec) mysql> grant all privileges on oozie.* to 'oozie'@'%' identified by 'oozie'; Query OK, 0 rows affected (0.03 sec) mysql> exit Bye

  • Configure Oozie to use MySQL.

    ... <property>      <name>oozie.service.JPAService.jdbc.driver</name>      <value>com.mysql.jdbc.Driver</value> </property> <property>      <name>oozie.service.JPAService.jdbc.url</name>      <value>jdbc:mysql://localhost:3306/oozie</value> </property> <property>       <name>oozie.service.JPAService.jdbc.username</name>       <value>oozie</value> </property> <property>       <name>oozie.service.JPAService.jdbc.password</name>       <value>oozie</value> </property> ...

  • Add the MySQL JDBC driver JAR to Oozie.

    Copy or symlink the MySQL JDBC driver JAR into the /var/lib/oozie/ directory.

For Postgres

See Metastore Database Requirements for supported versions and Installing and Configuring MySQL 5.x for instructions.

  • Create the Oozie user and Oozie database. For example, using the Postgres psql command-line tool:

    $ psql -U postgres Password for user postgres: ***** postgres=# CREATE ROLE oozie LOGIN ENCRYPTED PASSWORD 'oozie' NOSUPERUSER INHERIT CREATEDB NOCREATEROLE; CREATE ROLE postgres=# CREATE DATABASE "oozie" WITH OWNER = oozie ENCODING = 'UTF8' TABLESPACE = pg_default LC_COLLATE = 'en_US.UTF8' LC_CTYPE = 'en_US.UTF8' CONNECTION LIMIT = -1; CREATE DATABASE postgres=# \q

  • Configure Postgres to accept network connections for user oozie. Edit the Postgres data/pg_hba.conf file as follows:

    host oozie oozie 0.0.0.0/0 md5

  • Reload the Postgres configuration.

    $ sudo -u postgres pg_ctl reload -s -D /opt/PostgresSQL/8.4/data

  • Configure Oozie to use Postgres.

    Edit the oozie-site.xml file as follows:

    ... <property>      <name>oozie.service.JPAService.jdbc.driver</name>      <value>org.postgresql.Driver</value> </property> <property>      <name>oozie.service.JPAService.jdbc.url</name>      <value>jdbc:postgresql://localhost:5432/oozie</value> </property> <property>      <name>oozie.service.JPAService.jdbc.username</name>      <value>oozie</value> </property> <property>      <name>oozie.service.JPAService.jdbc.password</name>      <value>oozie</value> </property>...

For Oracle:

See Metastore Database Requirements for supported versions and Installing and Configuring MySQL 5.x for instructions.

  • Install and start Oracle 11g.

  • Create the Oozie Oracle user.

    For example, using the Oracle sqlplus command-line tool:

    $ sqlplus system@localhost Enter password: ****** SQL> create user oozie identified by oozie default tablespace users temporary tablespace temp; User created. SQL> grant all privileges to oozie; Grant succeeded. SQL> exit $

  • Configure Oozie to use Oracle.

    Create an Oracle dabase schema:

    Set oozie.service.JPAService.create.db.schema to true and oozie.db.schema.name=oozie.

    Edit the oozie-site.xml file as follows: <property>     <name>oozie.service.JPAService.jdbc.driver</name>     <value>oracle.jdbc.driver.OracleDriver</value> </property> <property>      <name>oozie.service.JPAService.jdbc.url</name>     <value>jdbc:oracle:thin:@localhost:1521:oozie</value> </property> <property>      <name>oozie.service.JPAService.jdbc.username</name>      <value>oozie</value> </property> <property>      <name>oozie.service.JPAService.jdbc.password</name>      <value>oozie</value> </property>

  • Add the Oracle JDBC driver JAR to Oozie. Copy or symlink the Oracle JDBC driver JAR in the /var/lib/oozie/ directory.

    ln -s ojdbc6.jar /usr/phd/current/oozie-server/lib

  • Edit oozie-env.sh and modify the following properties to match the directories created:

    export JAVA_HOME=/usr/java/default export OOZIE_CONFIG=${OOZIE_CONFIG:-/usr/phd/3.0.0.0-249/oozie/conf} export OOZIE_DATA=${OOZIE_DATA:-/var/db/oozie} export OOZIE_LOG=${OOZIE_LOG:-/var/log/oozie} export CATALINA_BASE=${CATALINA_BASE:-/usr/phd/3.0.0.0-249/oozie} export CATALINA_TMPDIR=${CATALINA_TMPDIR:-/var/tmp/oozie} export CATALINA_PID=${CATALINA_PID:-/var/run/oozie/oozie.pid} export OOZIE_CATALINA_HOME=/usr/lib/bigtop-tomcat export OOZIE_CLIENT_OPTS="${OOZIE_CLIENT_OPTS} -Doozie.connection.retry.count=5" export CATALINA_OPTS="${CATALINA_OPTS} -Xmx2048m -XX:MaxPermSize=256m" export JAVA_LIBRARY_PATH=/usr/lib/hadoop/lib/native/Linux-amd64-64

  • Copy the Configuration Files.

    On your Oozie server create the config directory, copy the configuration files, and set the permissions:

    rm -r $OOZIE_CONF_DIR; mkdir -p $OOZIE_CONF_DIR;

  • Copy all the config files to $OOZIE_CONF_DIR directory.

  • Set appropriate permissions.

    chown -R $OOZIE_USER:$HADOOP_GROUP $OOZIE_CONF_DIR/../ ; chmod -R 755 $OOZIE_CONF_DIR/../ ;

    where:

    • $OOZIE_CONF_DIR is the directory to store Oozie configuration files. For example, /etc/oozie/conf.

    • $OOZIE_DATA is the directory to store the Oozie data. For example, /var/db/oozie.

    • $OOZIE_LOG_DIR is the directory to store the Oozie logs. For example, /var/log/oozie.

    • $OOZIE_PID_DIR is the directory to store the Oozie process ID. For example, /var/run/oozie.

    • $OOZIE_TMP_DIR is the directory to store the Oozie temporary files. For example, /var/tmp/oozie.

    • $OOZIE_USER is the user owning the Oozie services. For example, oozie.

    • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

Configure your Database for Oozie

  • For Derby:

    No database configuration is required.

  • For MySQL:

    echo "create database if not exists oozie;" | mysql -u rootecho "grant all privileges on oozie.* to 'oozie'@'localhost' identified by 'oozie';" | mysql -u root echo "grant all privileges on oozie.* to 'oozie'@`hostname -f` identified by 'oozie';" | mysql -u root

  • For Postgres:

    echo "CREATE DATABASE oozie;" | psql -U postgresecho "CREATE USER oozie WITH PASSWORD 'oozie';" | psql -U postgres echo "GRANT ALL PRIVILEGES ON DATABASE oozie TO oozie;" | psql -U postgres echo "CREATE DATABASE oozie;" | psql -U postgresecho "CREATE USER oozie WITH PASSWORD 'oozie';" | psql -U postgres echo "GRANT ALL PRIVILEGES ON DATABASE oozie TO oozie;" | psql -U postgres

  • For Oracle:

    bash -l -c 'echo "create user oozie identified by oozie;" | sqlplus system/root@`hostname -f`' bash -l -c 'echo "GRANT SELECT_CATALOG_ROLE TO oozie;" | sqlplus system/ root@`hostname -f`' bash -l -c 'echo "GRANT CONNECT, RESOURCE TO oozie;" | sqlplus system/ root@`hostname -f`'

Validate the Installation

Use these steps to validate your installation.

  • If you are using a non-Derby database, copy your database connector jar file into /usr/ lib/oozie/libext.

  • Run the setup script to prepare the Oozie Server:

    cd /usr/phd/current/oozie/bin/oozie-setup.sh prepare-war chmod 777 -R /var/log/oozieln -s /etc/oozie/conf/action-conf /etc/oozie/conf.dist/action-conf

  • Create the Oozie DB schema:

    cd /usr/phd/current/oozie/ bin/ooziedb.sh create -sqlfile oozie.sql -run Validate DB Connection

  • Start the Oozie server:

    su -l oozie -c "/usr/phd/current/oozie-server/bin/oozied.sh start"

    where oozie is the $OOZIE_User.

  • Confirm that you can browse to the Oozie server:

    http://{oozie.full.hostname}:11000/oozie

  • Access the Oozie Server with the Oozie client:

    oozie admin -oozie http://$oozie.full.hostname :11000/oozie -status

    You should see the following output:

    System mode: NORMAL

Next Steps

For example workflow templates, download companion files, and use \oozie_workflows.

For more information about Apache Oozie, see http://oozie.apache.org/docs/4.0.0/ .

Installing Ranger

Apache Ranger delivers a comprehensive approach to security for a Hadoop cluster. It provides central security policy administration across the core enterprise security requirements of authorizaiton, auditing and data protection.

This chapter describes how to install Apache Ranger and the Ranger plug-ins in a LINUX Hadoop environment, and how to administer the plug-ins over Ambari. It includes information on:

For information about how to use Ranger, see the Ranger User Guide.

Installation Prerequisites

Before beginning Ranger installation, make sure the following software is already installed:

  • JDK 7 or above (available from the Oracle Java Download site)

  • Operating Systems supported:

    64-bit CentOS 6

    64-bit Red Hat Enterprise Linux (RHEL) 6

    64-bit SUSE Linux Enterprise Server (SLES) 11 SP3

  • Databases supported:

    MySQL v. 5.6 or above

    If the database server is not installed at the same host, Ranger services need to have access to the database server host.

Manual Installation

This section describes how to:

Installing Policy Manager

This section describes how to perform the following administrative tasks:

  • Install the Ranger Policy Manager

  • Start the Policy Manager service

  • Prepare for HDFS, HBase, Hive, or Knox plug-in installation

  • Make sure the PHD 3.0 repository is configured.

  • Find the Ranger Policy Admin software:

    yum search ranger
  • Install the Ranger Policy Admin software:

    yum install ranger_3_0_0_0_249-admin
  • At the Ranger Policy Administration installation directory, update the install.properties file as appropriate for PHD 3.0:

    • Go to the installation directory:

      cd /usr/phd/3.0.0.0-<version>/ranger-admin/

    • Edit the following install.properties entries using a text editor (e.g. emacs, vi, etc).:

      Configuration Property Name

      Default/Example Value

      Required?

      Ranger Policy Database

      DB_FLAVOR Specifies the type of database used for audit logging (MYSQL)

      MYSQL (default)

      Y

      SQL_CONNECTOR_JAR Path to SQL connector JAR. DB driver location for Mysql.

      /usr/share/java/mysql-connector-java.jar (default)

      Y

      db_root_user database username who has privileges for creating database schemas and users

      root (default)

      Y

      db_root_password database password for the "db_root_user"

      rootPassW0Rd

      Y

      db_host Hostname of the ranger policy database server

      localhost

      Y

      db_name Ranger Policy database name

      ranger (default)

      Y

      db_user db username used for performing all policy mgmt operation from policy admin tool

      rangeradmin (default)

      Y

      db_password database password for the "db_user"

      RangerAdminPassW0Rd

      Y

      Ranger Audit Database

      audit_db_name Ranger audit database name - This can be different database in the same database server mentioned above

      ranger_audit (default)

      Y

      audit_db_user Ranger audit database name - This can be different database in the same database server mentione

      rangerlogger (default)

      Y

      audit_db_password database password for the "audit_db_user"

      RangerLoggerPassW0Rd

      Y

      Policy Admin Tool Config

      policymgr_external_url URL used within Policy Admin tool when a link to its own page is generated in the Policy Admin Tool website

      http://localhost:6080 (default) http://myexternalhost.example.com:6080N

      policymgr_http_enabled Enables/disables HTTP protocol for downloading policies by Ranger plugins

      true (default)

      Y

      unix_user UNIX user who runs the Policy Admin Tool process

      ranger (default) (default)

      Y

      unix_group UNIX group associated with the UNIX user who runs the Policy Admin Tool process

      ranger (default)

      Y

      Policy Admin Tool Authentication

      authentication_method Authentication Method used to log in to the Policy Admin Tool. NONE -- only users created within the Policy Admin Tool may log in UNIX -- allows UNIX userid authentication using the UNIX authentication service (see below) LDAP -- allows Corporate LDAP authentication (see below) ACTIVE_DIRECTORY -- allows authentication using an Active Directory

      none (default)

      Y

      UNIX Authentication Service

      remoteLoginEnabled Flag to enable/disable remote Login via Unix Authentication Mode

      true (default)

      Y, if UNIX authentication_method is selected

      authServiceHostName Server Name (or ip-addresss) where ranger-usersync module is running (along with Unix Authentication Service)

      localhost (default) myunixhost.domain.com

      Y, if UNIX authentication_method is selected

      authServicePort Port Number where ranger-usersync module Is running Unix Authentication Service

      5151 (default)

      Y, if UNIX authentication_method is selected

      LDAP Authentication

      xa_ldap_url URL for the LDAP service

      ldap://192.0.2.0:389

      Y, if LDAP authentication_method is selectedd

      xa_ldap_userDNpattern LDAP DN Pattern used to uniquely locate the login user

      uid={0},ou=users,dc=xasecure,dc=net

      Y, if LDAP authentication_method is selectedd

      xa_ldap_groupSearchBase LDAP Base node location to get all groups associated with login user

      ou=groups,dc=xasecure,dc=net

      Y, if LDAP authentication_method is selectedd

      xa_ldap_groupSearchFilter LDAP search filter used to retrieve groups for the login user

      (member=uid={0},ou=users,dc=xasecure,dc=net)

      Y, if LDAP authentication_method is selectedd

      xa_ldap_groupRoleAttribute Attribute used to retrieve the group names from the group search filters

      cn

      Y, if LDAP authentication_method is selectedd

      Active Directory Authentication

      xa_ldap_ad_domain Active Directory Domain Name used for AD login

      example.com

      Y, if ACTIVE_DIRECTORY authentication_method is selectedd

      xa_ldap_ad_url Active Directory LDAP URL for authentication of user

      ldap://ad.example.com:389

      Y, if ACTIVE_DIRECTORY authentication_method is selectedd

  • Ensure the JAVA_HOME environment variable has been set.

    If the JAVA_HOME environment has not yet been set, enter:

    export JAVA_HOME=<path of installed jdk version folder>

  • Install the Ranger Policy Administration service:

    cd /usr/phd/<version>/ranger-admin
    ./setup.sh
  • Start the Ranger Policy Administration service by entering the following command:

    service ranger-admin start
  • To verify the service start, visit the browser's external URL; for example, http://<host_address>:6080/

Instaling UserSync

To install Ranger UserSync and start the service, do the following:

  • Find the Ranger UserSync software:

    yum search usersync or yum list | grep usersync

  • Install Ranger UserSync:

    yum install ranger_3_0_0_0_249-usersync.x86_64

  • At the Ranger UserSync installation directory, update install.properties file as appropriate for PHD 3.0:

    Configuration Property Name

    Default/Example Value

    Required?

    Policy Admin Tool

    POLICY_MGR-URL URL for policy admin

    http://policymanager.example.com:6080

    Y

    User Group Source Information

    SYNC_SOURCE Specifies where the user/group information is extracted to be put into ranger database. unix - get user information from /etc/passwd file and gets group information from /etc/group file ldap - gets user information from LDAP service (see below for more information)

    unix

    N

    SYNC_INTERVAL Specifies the interval (in minutes) between synchronization cycle. Note, the 2nd sync cycle will NOT start until the first sync cycle is COMPLETE.

    5

    N

    UNIX user/group Synchronization

    MIN_UNIX_USER_ID_TO_SYNC UserId below this parameter values will not be synchronized to Ranger user database

    300 (Unix default), 1000 (LDAP default)

    Mandatory if SYNC_SOURCE is selected as unix

    LDAP user/group synchronization

    SYNC_LDAP_URL URL of source ldap

    ldap://ldap.example.com:389

    Mandatory if SYNC_SOURCE is selected as ldap

    SYNC_LDAP_BIND_DN ldap bind dn used to connect to ldap and query for users and groups

    cn=admin,ou=users,dc=hadoop,dc=apache,dc-org

    Mandatory if SYNC_SOURCE is selected as ldap

    SYNC_LDAP_BIND_PASSWORD ldap bind password for the bind dn specified above

    LdapAdminPassW0Rd

    Mandatory if SYNC_SOURCE is selected as ldap

    CRED_KEYSTORE_FILENAME Location of the file where crypted password is kept

    /usr/lib/xausersync/.jceks/xausersync.jceks (default) /etc/ranger/usersync/.jceks/xausersync.jceks

    Mandatory if SYNC_SOURCE is selected as ldap

    SYNC_LDAP_USER_SEARCH_BASE search base for users

    ou=users,dc=hadoop,dc=apache,dc=org

    Mandatory if SYNC_SOURCE is selected as ldap

    SYNC_LDAP_USER_SEARCH_SCOPE search scope for the users, only base, one and sub are supported values

    sub (default)

    N

    SYNC_LDAP_USER_OBJECT_CLASS objectclass to identify user entries

    person (default)

    N (defaults to person)

    SYNC_LDAP_USER_SEARCH_FILTER optional additional filter constraining the users selected for syncing

    (dept=eng)

    N (defaults to an empty string)

    SYNC_LDAP_USER_NAME_ATTRIBUTE attribute from user entry that would be treated as user name

    cn (default)

    N (defaults to cn)

    SYNC_LDAP_USER_GROUP_NAME_ATTRIBUTE attribute from user entry whose values would be treated as group values to be pushed into Policy Manager database. You could provide multiple attribute names separated by comma

    memberof,ismemberof (default)

    N (defaults to memberof, ismemberof)

    User Synchronization

    unix_user Unix User who runs the ranger-usersync process

    ranger (default)

    Y

    unix_group Unix group associated with Unix user who runs the ranger-usersync process

    ranger (default)

    Y

    SYNC_LDAP_USERNAME_CASE_CONVERSION Convert all username to lower/upper case none - no conversation will be done. Kept as it is in the SYNC_SOURCE lower - convert it to lower case when saving it to ranger db upper - convert it to upper case when saving it to ranger db

    lower (default)

    N (defaults to lower)

    SYNC_LDAP_GROUPNAME_CASE_CONVERSION Convert all username to lower/upper case none - no conversation will be done. Kept as it is in the SYNC_SOURCE lower - convert it to lower case when saving it to ranger db upper - convert it to upper case when saving it to ranger db

    lower (default)

    N (defaults to lower)

    logdir Location of the log directory were the usersync logs are stored

    logs (default)

    Y

  • Set the Policy Manager URL to http://<ranger-admin-host>:6080

  • Check the JAVA_HOME environment variable:

    If JAVA_HOME has not yet been set, enter:

    export JAVA_HOME=<path of installed jdk version folder>

  • Install the Ranger UserSync service:

    cd /usr/phd/<version>/ranger-usersync ./setup.sh

  • Start the Ranger UserSync service:

    service ranger-usersync start

  • To verify that the service was successfully started, wait 6 hours for LDAP and AD to synchronize, then do the following:

    • Go to http://<ranger-admin-host>:6080

    • Click the Users/Group tab. See if users and groups are synchronized.

    • Add a UNIX/LDAP/AD user, then check for the presence of that user in the Ranger Admin tab.

Installing Ranger Plug-ins

All Ranger plug-ins can be installed manually. Once installed, they can be administered over Ambari.

HDFS

The Ranger HDFS plugin helps to centralize HDFS authorization policies.

This section describes how to create an HDFS repository in Policy Manager, and install the HDFS plug-in. Once these steps are completed, the HDFS plug-in can be managed using Ambari as described later in this chapter.

  • Create an HDFS repository in the Ranger Policy Manager. To create an HDFS repository, refer to the Repository Manager section of the Ranger User Guide.

    Complete the HDFS Create Repository screen in the Policy Manager, as described in the Ranger User Guide.

    Make a note of the name you gave to this repository; you will need to use it again during HDFS plug-in setup.

  • At all servers where NameNode is installed, install the HDFS plug-in by following the steps listed below:

    • Go to the home directory of the HDFS plug-in:

      /usr/phd/<version>/ranger-hdfs-plugin
    • Edit the following HDFS-related install.properties using a text editor (e.g., emacs, vi, etc).

      Configuration Property Name

      Default/Example Value

      Required?

      Policy Admin Tool

      POLICY_MGR-URL URL for policy admin

      http://policymanager.example.com:6080

      Y

      REPOSITORY_NAME The repository name used in Policy Admin Tool for defining policies

      hadoopdev

      Y

      Audit Database

      SQL_CONNECTOR_JAR Path to SQL connector JAR. DB driver location for Mysql.

      /usr/share/java/mysql-connector-java.jar (default0

      Y

      XAAUDIT.DB.IS_ENABLED Flag to enable/disable database audit logging.If the database audit logging is turned off, it will not log any access control to database

      FALSE (default) TRUE

      Y

      XAAUDIT.DB.FLAVOUR Specifies the type of database used for audit logging (MYSQL)

      MYSQL (default)

      Y

      XAAUDIT.DB.HOSTNAME Hostname of the audit database server

      localhost

      Y

      XAAUDIT.DB.DATABASE_NAME Audit database name

      ranger_audit

      Y

      XAAUDIT.DB.USER_NAME Username used for performing audit log inserts (should be same username used in the ranger-admin installation process)

      rangerlogger

      Y

      XAAUDIT.DB.PASSWORD database password associated with the above database user - for db audit logging

      rangerlogger

      Y

      HDFS Audit

      XAAUDIT.HDFS.IS_ENABLED Flag to enable/disable hdfs audit logging.If the hdfs audit logging is turned off, it will not log any access control to hdfs

      Y

      XAAUDIT.HDFS.DESTINATION_DIRECTORY HDFS directory where the audit log will be stored

      hdfs://__REPLACE__NAME_NODE_HOST:8020/ (format) hdfs://namenode.mycompany.com:8020/ranger/audit/%app-type%/%time:yyyyMMdd%

      Y

      XAAUDIT.HDFS.LOCAL_BUFFER_DIRECTORY Local directory where the audit log will be saved for intermediate storage

      __REPLACE__LOG_DIR/%app-type%/audit (format) /var/log/%app-type%/audit

      Y

      XAAUDIT.HDFS.LOCAL_ARCHIVE_DIRECTORY Local directory where the audit log will be archived after it is moved to hdfs

      __REPLACE__LOG_DIR/%app-type%/audit/archive (format) /var/log/%app-type%/audit/archive

      Y

      XAAUDIT.HDFS.DESTINTATION_FILE hdfs audit file name (format)

      %hostname%-audit.log (default)

      Y

      XAAUDIT.HDFS.DESTINTATION_FLUSH_INTERVAL_SECONDS hdfs audit log file writes are flushed to HDFS at regular flush interval

      900

      Y

      XAAUDIT.HDFS.DESTINTATION_ROLLOVER_INTERVAL_SECONDS hdfs audit log file is rotated to write to a new file at a rollover interval specified here

      86400

      Y

      XAAUDIT.HDFS.DESTINTATION_OPEN_RETRY_INTERVAL_SECONDS hdfs audit log open() call is failed, it will be re-tried at this interval

      60

      Y

      XAAUDIT.HDFS.LOCAL_BUFFER_FILE Local filename used to store in audit log (format)

      %time:yyyyMMdd-HHmm.ss%.log (default)

      Y

      XAAUDIT.HDFS.LOCAL_BUFFER_FLUSH_INTERVAL_SECONDS Local audit log file writes are flushed to filesystem at regular flush interval

      60

      Y

      XAAUDIT.HDFS.LOCAL_BUFFER_ROLLOVER_INTERVAL_SECONDS Local audit log file is rotated to write to a new file at a rollover interval specified here

      600

      Y

      XAAUDIT.HDFS.LOCAL_ARCHIVE_MAX_FILE_COUNT The maximum number of local audit log files will be kept in the archive directory

      10

      Y

      SSL Information (https connectivity to Policy Admin Tool)

      SSL_KEYSTORE_FILE_PATH Java Keystore Path where SSL key for the plugin is stored. Is used only if SSL is enabled between Policy Admin Tool and Plugin; If SSL is not Enabled, leave the default value as it is - do not set as EMPTY if SSL not used

      /etc/hadoop/conf/ranger-plugin-keystore.jks (default)

      Only if SSL is enabled

      SSL_KEYSTORE_PASSWORD Password associated with SSL Keystore. Is used only if SSL is enabled between Policy Admin Tool and Plugin; If SSL is not Enabled, leave the default value as it is - do not set as EMPTY if SSL not used

      none (default)

      Only if SSL is enabled

      SSL_TRUSTSTORE_FILE_PATH Java Keystore Path where the trusted certificates are stored for verifying SSL connection to Policy Admin Tool. Is used only if SSL is enabled between Policy Admin Tool and Plugin; If SSL is not Enabled, leave the default value as it is - do not set as EMPTY if SSL not used

      /etc/hadoop/conf/ranger-plugin-truststore.jks (default)

      Only if SSL is enabled

      SSL_TRUSTORE_PASSWORD Password associated with Truststore file. Is used only if SSL is enabled between Policy Admin Tool and Plugin; If SSL is not Enabled, leave the default value as it is - do not set as EMPTY if SSL not used

      none (default)

      Only if SSL is enabled

  • Enable the HDFS plug-in by entering the following commands:

    cd /usr/phd/<version>/ranger-hdfs-plugin
    ./enable-hdfs-plugin.sh
  • To confirm that installation and configuration are complete, go to the Audit Tab of the Ranger Admin Console and check Agents. You should see HDFS listed there.

HBase

The Ranger HBase Plug-in integrates with HBase to enforce authorization policies.

This section describes how to perform the following tasks:

  • Install the Ranger Policy Manager

  • Start the Policy Manager service

  • Create an HBase repository in Policy Manager

  • Install the HBase plug-in

    Once these steps are completed, the HBase plug-in can be administered over Ambari as described later in this chapter.

  • Create an HBase repository in the Ranger Policy Manager. To create the HBase repository, refer to the Repository Manager section in the Ranger User Guide for the steps on how to create a repository.

    Complete the HBase Create Repository screen in the Policy Manager, as described in the Ranger User Guide.

    Make a note of the name you gave to this repository; you will use it again during HBase plug-in setup.

  • At all servers where the HBase Master and HBase RegionServer are installed, install the HBase plug-in by following the steps listed below:

    • Go to the home directory of the HBase plug-in:

      /usr/phd/<version>/ranger-hbase-plugin
    • Edit the following HBase-related install.properties using a text editor (e.g. emacs, vi, etc.):

      Configuration Property Name

      Default/Example Value

      Required?

      Policy Admin Tool

      POLICY_MGR-URL URL for policy admin

      http://policymanager.example.com:6080

      Y

      REPOSITORY_NAME The repository name used in Policy Admin Tool for defining policies

      hbasedev

      Y

      Audit Database

      SQL_CONNECTOR_JAR Path to SQL connector JAR. DB driver location for Mysql.

      /usr/share/java/mysql-connector-java.jar (default)

      Y

      XAAUDIT.DB.IS_ENABLED Flag to enable/disable database audit logging.If the database audit logging is turned off, it will not log any access control to database

      FALSE (default)

      Y

      XAAUDIT.DB.FLAVOUR Specifies the type of database used for audit logging (MYSQL)

      MYSQL (default)

      Y

      XAAUDIT.DB.HOSTNAME Hostname of the audit database server

      localhost

      Y

      XAAUDIT.DB.DATABASE_NAME Audit database name

      ranger_audit

      Y

      XAAUDIT.DB.USER_NAME Username used for performing audit log inserts (should be same username used in the ranger-admin installation process)

      rangerlogger

      Y

      XAAUDIT.DB.PASSWORD Database password associated with the above database user - for db audit logging

      rangerlogger

      Y

      HDFS Audit

      XAAUDIT.HDFS.IS_ENABLED Flag to enable/disable hdfs audit logging.If the hdfs audit logging is turned off, it will not log any access control to hdfs

      TRUE

      Y

      XAAUDIT.HDFS.DESTINATION_DIRECTORY HDFS directory where the audit log will be stored

      hdfs://__REPLACE__NAME_NODE_HOST:8020/ranger/audit/%app-type%/%time:yyyyMMdd% (format) hdfs://namenode.mycompany.com:8020/ranger/audit/%app-type%/%time:yyyyMMdd%

      Y

      XAAUDIT.HDFS.LOCAL_BUFFER_DIRECTORY Local directory where the audit log will be saved for intermediate storage

      __REPLACE__LOG_DIR/%app-type%/audit (format) /var/tmp/%app-type%/audit

      Y

      XAAUDIT.HDFS.LOCAL_ARCHIVE_DIRECTORY Local directory where the audit log will be archived after it is moved to hdfs

      __REPLACE__LOG_DIR/%app-type%/audit/archive (format) /var/tmp/%app-type%/audit/archive

      Y

      XAAUDIT.HDFS.DESTINTATION_FILE HDFS audit file name (format)

      %hostname%-audit.log (default)

      Y

      XAAUDIT.HDFS.DESTINTATION_FLUSH_INTERVAL_SECONDS HDFS audit log file writes are flushed to HDFS at regular flush interval

      900

      Y

      XAAUDIT.HDFS.DESTINTATION_ROLLOVER_INTERVAL_SECONDS HDFS audit log file is rotated to write to a new file at a rollover interval specified here

      86400

      Y

      XAAUDIT.HDFS.DESTINTATION_OPEN_RETRY_INTERVAL_SECONDS If HDSF audit log open() call fails, it will be re-tried at this interval

      60

      Y

      XAAUDIT.HDFS.LOCAL_BUFFER_FILE Local filename used to store in audit log (format)

      %time:yyyyMMdd-HHmm.ss%.log (default)

      Y

      XAAUDIT.HDFS.LOCAL_BUFFER_FLUSH_INTERVAL_SECONDS Interval that local audit log file writes are flushed to filesystem

      60

      Y

      XAAUDIT.HDFS.LOCAL_BUFFER_ROLLOVER_INTE RVAL_SECONDS Interval that local audit log file is rolled over (rotated to write to a new file)

      600

      Y

      XAAUDIT.HDFS.LOCAL_ARCHIVE_MAX_FILE_COUNT The maximum number of local audit log files will be kept in the archive directory

      10

      Y

      SSL_KEYSTORE_FILE_PATH Java Keystore Path where SSL key for the plugin is stored. Used only if SSL is enabled between Policy Admin Tool and Plugin. If SSL is not enabled, leave the default value as it is (should not be set as EMPTY).

      /etc/hbase/conf/ranger-plugin-keystore.jks (default)

      Y, if SSL is enabled

      SSL_KEYSTORE_PASSWORD Password associated with SSL Keystore. Used only if SSL is enabled between Policy Admin Tool and Plugin. If SSL is not Enabled, leave the default value as it is (should not be set as EMPTY).

      myKeyFilePassword (default)

      Y, if SSL is enabled

      SSL_TRUSTSTORE_FILE_PATH Java Keystore Path where the trusted certificates are stored for verifying SSL connection to Policy Admin Tool. Used only if SSL is enabled between Policy Admin Tool and Plugin. If SSL is not enabled, leave the default value as it is (should not be set as EMPTY).

      /etc/hbase/conf/ranger-plugin-truststore.jks (default)

      Y, if SSL is enabled

      SSL_TRUSTORE_PASSWORD Password associated with Truststore file. Used only if SSL is enabled between Policy Admin Tool and Plugin. If SSL is not Enabled, leave the default value as it is (should not be set as EMPTY).

      changeit (default)

      Y, if SSL is enabled

      HBase GRANT/REVOKE Commands

      UPDATE_XAPOLICIES_ON_GRANT_RE VOKE Provide ability for XAAgent to update the policies based on the GRANT/REVOKE commands from the HBase client

      TRUE (default)

      Y

  • Enable the HBase plug-in by entering the following commands:

    cd /usr/phd/<version>l/ranger-hbase-plugin
    ./enable-hbase-plugin.sh
  • Re-start HBase.

  • To confirm that installation and configuration are complete, go to the Audit Tab of the Ranger Admin Console and check Agents. You should see HBase listed there.

Hive

The Ranger Hive plugin integrates with Hive to enforce authorization policies.

This section describes how to perform the following asministrative tasks; Once these steps are completed, the Hive plug-in can be administered over Ambari as described later in this chapter.

  • Install the Ranger Policy Manager

  • Start the Policy Manager service

  • Create a Hive repository in Policy Manager

  • Install the Hive plug-in

  • Create an Hive repository in the Ranger Policy Manager. To create the Hive repository, follow the steps described in the Repository Manager section of the Ranger User Guide

    Complete the Hive Create Repository screen in the Policy Manager, as described in the Ranger User Guide.

    Make a note of the name you gave to this repository; you will need to use it again during Hive plug-in setup.

  • At the server where HiveServer2 is installed, install the Hive plug-in:

    • Go to the home directory of the Hive plug-in:

      cd /usr/phd/<version>/ranger-hive-plugin
    • Edit the following Hive-related install.properties using a text editor (e.g., emacs, vi, etc).

      Configuration Property Name

      Default/Example Value

      Required?

      Policy Admin Tool

      POLICY_MGR-URL URL for policy admin

      http://policymanager.example.com:6080

      Y

      REPOSITORY_NAME The repository name used in Policy Admin Tool for defining policies

      hivedev

      Y

      Audit Database

      SQL_CONNECTOR_JAR Path to SQL connector JAR. DB driver location for Mysql.

      /usr/share/java/mysql-connector-java.jar (default)

      Y

      XAAUDIT.DB.IS_ENABLED Flag to enable/disable database audit logging.If the database audit logging is turned off, it will not log any access control to database

      FALSE (default) TRUE

      Y

      XAAUDIT.DB.FLAVOUR pecifies the type of database used for audit logging (MYSQL)

      MYSQL (default)

      Y

      XAAUDIT.DB.HOSTNAME Hostname of the audit database server

      localhost

      Y

      XAAUDIT.DB.DATABASE_NAME Audit database name

      ranger_audit

      Y

      XAAUDIT.DB.USER_NAME Username used for performing audit log inserts (should be same username used in the ranger-admin installation process)

      rangerlogger

      Y

      XAAUDIT.DB.PASSWORD database password associated with the above database user - for db audit logging

      rangerlogger

      Y

      HDFS Audit

      XAAUDIT.HDFS.IS_ENABLED Flag to enable/disable hdfs audit logging.If the hdfs audit logging is turned off, it will not log any access control to hdfs

      Y

      XAAUDIT.HDFS.DESTINATION_DIRECTORY HDFS directory where the audit log will be stored

      hdfs://__REPLACE__NAME_NODE_HOST:8020/ranger/audit/%app-type%/%time:yyyyMMdd% (format) hdfs://namenode.mycompany.com:8020/ranger/audit/%app-type%/%time:yyyyMMdd%

      Y

      XAAUDIT.HDFS.LOCAL_BUFFER_DIRECTORY Local directory where the audit log will be saved for intermediate storage

      __REPLACE__LOG_DIR/%app-type%/audit (format) /var/tmp/%app-type%/audit

      Y

      XAAUDIT.HDFS.LOCAL_ARCHIVE_DIRECTORY Local directory where the audit log will be archived after it is moved to hdfs

      __REPLACE__LOG_DIR/%app-type%/audit (format) /var/tmp/%app-type%/audit/archive

      Y

      XAAUDIT.HDFS.DESTINTATION_FILE hdfs audit file name (format)

      %hostname%-audit.log (default)

      Y

      XAAUDIT.HDFS.DESTINTATION_FLUSH_INTERVAL_SECONDS hdfs audit log file writes are flushed to HDFS at regular flush interval

      900

      Y

      XAAUDIT.HDFS.DESTINTATION_ROLLOVER_INTERVAL_SECONDS hdfs audit log file is rotated to write to a new file at a rollover interval specified here

      86400

      Y

      XAAUDIT.HDFS.DESTINTATION_OPEN_RETRY_INTERVAL_SECONDS hdfs audit log open() call is failed, it will be re-tried at this interval

      60

      Y

      XAAUDIT.HDFS.LOCAL_BUFFER_FILE Local filename used to store in audit log (format)

      %time:yyyyMMdd-HHmm.ss%.log (default)

      Y

      XAAUDIT.HDFS.LOCAL_BUFFER_FLUSH_INTERVAL_SECONDS Local audit log file writes are flushed to filesystem at regular flush interval

      60

      Y

      XAAUDIT.HDFS.LOCAL_BUFFER_ROLLOVER_INTERVAL_SECONDS Local audit log file is rotated to write to a new file at a rollover interval specified here

      600

      Y

      XAAUDIT.HDFS.LOCAL_ARCHIVE_MAX_FILE_COUNT The maximum number of local audit log files will be kept in the archive directory

      10

      Y

      SSL Information (https connectivity to Policy Admin Tool)

      SSL_KEYSTORE_FILE_PATH Java Keystore Path where SSL key for the plugin is stored. Is used only if SSL is enabled between Policy Admin Tool and Plugin; If SSL is not Enabled, leave the default value as it is - do not set as EMPTY if SSL not used

      /etc/hive/conf/ranger-plugin-keystore.jks (default)

      If SSL is enabled

      SSL_KEYSTORE_PASSWORD Password associated with SSL Keystore. Is used only if SSL is enabled between Policy Admin Tool and Plugin; If SSL is not Enabled, leave the default value as it is - do not set as EMPTY if SSL not used

      none (default)

      If SSL is enabled

      SSL_TRUSTSTORE_FILE_PATH Java Keystore Path where the trusted certificates are stored for verifying SSL connection to Policy Admin Tool. Is used only if SSL is enabled between Policy Admin Tool and Plugin; If SSL is not Enabled, leave the default value as it is - do not set as EMPTY if SSL not used

      /etc/hive/conf/ranger-plugin-truststore.jks (default)

      If SSL is enabled

      SSL_TRUSTORE_PASSWORD Password associated with Truststore file. Is used only if SSL is enabled between Policy Admin Tool and Plugin; If SSL is not Enabled, leave the default value as it is - do not set as EMPTY if SSL not used

      none (default)

      If SSL is enabled

      Hive GRANT/REVOKE Command Handling

      UPDATE_XAPOLICIES_ON_GRANT_REVOKE Provide ability for XAAgent to update the policies based on the grant/revoke commands from the Hive beeline client

      TRUE (default)

      Y

  • Enable the Hive plug-in by entering the following commands:

    cd /usr/phd/<version>/ranger-hive-plugin
    ./enable-hive-plugin.sh
  • Re-start Hive.

  • To confirm that installation and configuration are complete, go to the Audit Tab of the Ranger Admin Console and check Agents. You should see Hive listed there.

Knox

The Ranger Knox plugin integrates with Knox to enforce authorization policies.

This section describes how to perform the following administrative tasks: install the Ranger Policy Manager, start the Policy Manager service, create a Knox repository in Policy Manager, and install the Knox plug-in. It assumes that Knox has already been installed, as described in Installing Knox. Once these steps are completed, the Knox plug-in can be administered over Ambari as described later in this chapter.

  • Install the Ranger Policy Manager

  • Start the Policy Manager service

  • Create a Knox repository in Policy Manager

  • Install the Knox plug-in

These steps assume that Knox has already been installed, as described in the Installing Knox section of this document. Once these steps are completed, the Knox plug-in can be administered over Ambari as described later in this chapter.

  • Create a Knox repository in the Ranger Policy Manager. To create the Knox repository, refer to the Repository Manager section in the Ranger User Guide for the steps on how to create a repository

    Complete the Knox Create Repository screen in the Policy Manager, as described in the Ranger User Guide. Set the URL as https://knox_host:8443/gateway/admin/api/v1/topologies, where knox_host is the full-qualified name of your Knox host machine.

    Make a note of the name you gave to this repository; you will need to use it again during Knox plug-in setup.

  • At all servers where Knox Gateway is installed, install the Knox plug-in:

    • Go to the home directory of the Knox plug-in:

      /usr/phd/<version>/ranger-knox-plugin
    • Edit the following Knox-related install.properties using a text editor (e.g., emacs, vi, etc):

      Configuration Property Name

      Default/Example Value

      Mandatory?

      Policy Admin Tool

       

       

      POLICY_MGR-URL URL for policy admin

      http://policymanager.example.com:6080

      Y

      REPOSITORY_NAME The repository name used in Policy Admin Tool for defining policies

      knoxdev

      Y

      Knox Component Installation

       

       

      KNOX_HOME Home directory where Knox software is installed

      /usr/phd/current/knox

      Y

      Audit Database

       

       

      SQL_CONNECTOR_JAR Path to SQL connector JAR. DB driver location for Mysql

      /usr/share/java/mysql-connector-java.jar

      Y

      XAAUDIT.DB.IS_ENABLED Flag to enable/disable database audit logging.If the database audit logging is turned off, it will not log any access control to database

      true

      Y

      XAAUDIT.DB.FLAVOUR Specifies the type of database used for audit logging (MYSQL)

      MYSQL

      Y

      XAAUDIT.DB.HOSTNAME Hostname of the audit database server

      localhost

      Y

      XAAUDIT.DB.DATABASE_NAME Audit database name

      ranger_audit

      Y

      XAAUDIT.DB.USER_NAME Username used for performing audit log inserts (should be same username used in the ranger-admin installation process)

      rangerlogger

      Y

      XAAUDIT.DB.PASSWORD database password associated with the above database user - for db audit logging

      rangerlogger

      Y

      HDFS Audit

       

       

      XAAUDIT.HDFS.IS_ENABLED Flag to enable/disable hdfs audit logging.If the hdfs audit logging is turned off, it will not log any access control to hdfs

       

      Y

      XAAUDIT.HDFS.DESTINATION_DIRECTORY HDFS directory where the audit log will be stored

      hdfs://namenode.mycompany.com:8020/ranger/audit/%app-type%/%time:yyyyMMdd%

      Y

      XAAUDIT.HDFS.LOCAL_BUFFER_DIRECTORY Local directory where the audit log will be saved for intermediate storage

      /var/tmp/%app-type%/audit

      Y

      XAAUDIT.HDFS.LOCAL_ARCHIVE_DIRECTORY Local directory where the audit log will be archived after it is moved to hdfs

      /var/tmp/%app-type%/audit/archive

      Y

      XAAUDIT.HDFS.DESTINTATION_FILE hdfs audit file name (format)

      %hostname%-audit.log

      Y

      XAAUDIT.HDFS.DESTINTATION_FLUSH_INTERVAL_SECONDS hdfs audit log file writes are flushed to HDFS at regular flush interval

      900

      Y

      XAAUDIT.HDFS.DESTINTATION_ROLLOVER_INTERVAL_SECONDS hdfs audit log file is rotated to write to a new file at a rollover interval specified here

      86400

      Y

      XAAUDIT.HDFS.DESTINTATION_OPEN_RETRY_INTERVAL_SECONDS hdfs audit log open() call is failed, it will be re-tried at this interval

      60

      Y

      XAAUDIT.HDFS.LOCAL_BUFFER_FILE Local filename used to store in audit log (format)

      %time:yyyyMMdd-HHmm.ss%.log

      Y

      XAAUDIT.HDFS.LOCAL_BUFFER_FLUSH_INTERVAL_SECONDS Local audit log file writes are flushed to filesystem at regular flush interval

      60

      Y

      XAAUDIT.HDFS.LOCAL_BUFFER_ROLLOVER_INTERVAL_SECONDS Local audit log file is rotated to write to a new file at a rollover interval specified here

      600

      Y

      XAAUDIT.HDFS.LOCAL_ARCHIVE_MAX_FILE_COUNT The maximum number of local audit log files will be kept in the archive directory

      10

      Y

      SSL (https connectivity to Policy Admin Tool)

       

       

      SSL_KEYSTORE_FILE_PATH Java Keystore Path where SSL key for the plugin is stored. Is used only if SSL is enabled between Policy Admin Tool and Plugin; If SSL is not Enabled, leave the default value as it is - do not set as EMPTY if SSL not used

      /etc/knox/conf/ranger-plugin-keystore.jks

      If SSL is enabled

      SSL_KEYSTORE_PASSWORD Password associated with SSL Keystore. Is used only if SSL is enabled between Policy Admin Tool and Plugin; If SSL is not Enabled, leave the default value as it is - do not set as EMPTY if SSL not used

      myKeyFilePassword

      If SSL is enabled

      SSL_TRUSTSTORE_FILE_PATH Java Keystore Path where the trusted certificates are stored for verifying SSL connection to Policy Admin Tool. Is used only if SSL is enabled between Policy Admin Tool and Plugin; If SSL is not Enabled, leave the default value as it is - do not set as EMPTY if SSL not used

      /etc/knox/conf/ranger-plugin-truststore.jks

      If SSL is enabled

      SSL_TRUSTORE_PASSWORD Password associated with Truststore file. Is used only if SSL is enabled between Policy Admin Tool and Plugin; If SSL is not Enabled, leave the default value as it is - do not set as EMPTY if SSL not used

      changeit

      If SSL is enabled

  • Enable the Knox plug-in by entering the following commands:

    cd /usr/phd/<version>/ranger-knox-plugin
    ./enable-knox-plugin.sh
  • Re-start Knox Gateway.

  • To confirm that installation and configuration are complete, go to the Audit Tab of the Ranger Admin Console and check Agents. You should see Knox listed there.

Verifying the Installation

To verify that installation was successful, perform any or all of the following checks:

  • Check whether the Database RANGER_ADMIN_DB_NAME is present in the MySQL server running on RANGER_ADMIN_DB_HOST

  • Check whether the Database RANGER_AUDIT_DB_NAME is present in the MySQL server running on RANGER_AUDIT_DB_HOST

  • Check whether the “ranger-admin” service is installed in services.msc (Windows only)

  • Check whether the “ranger-usersync” service is installed in services.msc (Windows only)

  • If you plan to use the Ranger Administration Console with the UserSync feature, check whether both services start

  • Go to the Ranger Administration Console host URL and make sure you can log in using the default user credentials

Administering Ranger Plug-ins Over Ambari 1.7

This section describes how to configure the following Ranger plugins on top of the Ambari-installed cluster components:

  • HDFS

  • HBase

  • Hive

  • Knox

This section assumes you have already manually installed these plug-ins, as described in Installing Ranger Plug-ins Manual Installation.

In Ambari 1.7, when you restart the HDFS, HBase, Hive, and Knox for the first time via Ambari, these configuration files are overwritten, which causes the Ranger plug-ins for those services to fail. The workaround for this is to update the configuration of each component via Ambari, and then restart the components via Ambari.

Most of the actions in this section occur in the Ambari console, at the Config tabs for HDFS, HBase, Hive, and Knox.

Ambari console Config tabs, with HBase selected

HDFS

Starting HDFS via Ambari overrides certain parameters in hdfs-site.xml. To reinstate the lost Ranger parameters so that the HDFS plug-in can function, perform the following steps:

  • In the Ambari console, go to the HDFS Config tab.

  • At Advance gdfs-site configuration, under Custom-hdfs-site, set dfs.permissions.enabled to true.

  • Restart HDFS via Ambari. The changes you just made are effective immediately, and the Ranger HDFS plug-in is enabled.

HBase

Starting HBase via Ambari overrides certain parameters in hbase-site.xml. To reinstate the lost Ranger parameters so that the HDFS plug-in can function, perform the following steps:

  • In the Ambari console, navigate to the HBase Config tab.

  • At Advance hbase-site configuration:

    • Change parameter value for “hbase.security.authorization” to true.

    • Change the parameter value for “hbase.coprocessor.master.classes” to com.xasecure.authorization.hbase.XaSecureAuthorizationCoprocessor.

    • Change the parameter value for “hbase.coprocessor.region.classes” to com.xasecure.authorization.hbase.XaSecureAuthorizationCoprocessor.

  • If this is a secured cluster, go to custom hdfs-site and perform the following steps:

    • Add a new parameter value for “hbase.rpc.protection” and set it to PRIVACY.

    • And a new parameter value for “hbase.rpc.engine” and set it to org.apache.hadoop.hbase.ipc.SecureRpcEngine.

  • Restart HBase via Ambari. The changes you just made are effective immediately, and the Ranger HBase plug-in is enabled.

Hive

Starting HDFS via Ambari overrides certain parameters in hive-site.xml. To reinstate the lost Ranger parameters so that the Hive plug-in can function, perform the following steps:

  • Make sure the repository is set to Enabled.

  • Set the jdbc.url to jdcb:hive2//localhost:10000/default;auth=noSasl.

Knox

No special changes are needed to the Knox plug-in configuration when accessed through Ambari.

Installing Hue

Hue provides a Web application interface for Apache Hadoop. It supports a file browser, JobTracker interface, Hive, Pig, Oozie, HBase, and more.

Prerequisites

Complete the following prerequisites before deploying Hue.

  • Verify that you have a host that supports Hue:

    • 64-bit Red Hat Enterprise Linux (RHEL) 5 or 6

    • 64-bit CentOS 6

    • 64-bit SUSE Linux Enterprise Server (SLES) 11 SP3

    • Verify that you have a browser that supports Hue:

      Linux (RHEL, CentOS, SLES)

      Windows (VISTA, 7)

      Mac OS X (10.6 or later)

      Firefox latest stable release

      Firefox latest stable release

      Firefox latest stable release

      Google Chrome latest stable release

      Google Chrome latest stable release

      Google Chrome latest stable release

      N/A

      Internet Explorer 9 (for Vista + Windows 7)

      N/A

      N/A

      Safari latest stable release

      Safari latest stable release

    • Verify that you have at least Python 2.6.6 or higher installed.

    • Stop all the services in your cluster. For more information see the instructions provided in the Reference Guide.

    • Install and run the PHD 3.0 Hadoop cluster.

      The following table outlines dependencies on the PHD components:

      Component

      Required

      Applications

      Notes

      HDFS

      Yes

      Core, Filebrowser

      HDFS access through WebHDFS or HttpFS

      YARN

      Yes

      Core, Filebrowser

      Transitive dependency via Hive or Oozie

      Oozie

      No

      JobDesigner, Oozie

      Oozie access through REST API

      Hive

      No

      Hive, HCatalog

      Beeswax server uses the Hive client libraries

      WebHCat

      No

      HCatalog, Pig

      HCatalog and Pig use WebHcat REST API

    • Choose a Hue Server host machine in your cluster to deploy your Hue Server.

      You can deploy Hue on any host within your cluster. If your corporate firewall policies allow, you can also use a remote host machine as your Hue server. For evaluation or small cluster sizes, use the master install machine for PHD as your Hue server.

      The Hive client configuration (hive-site.xml file) needs to be populated on the host to run Hue. In the beeswax section of hue.ini, configure hive_conf_dir to point to the location of the Hive configuration:

      hive_conf_dir=/etc/hive/conf

    • Configure the firewall.

      • Verify that the host machines within your cluster can connect to each other over TCP.

      • The machines outside your cluster must be able to open TCP port 8000 on the Hue Server (or the configured Hue web HTTP port) to interact with the system.

Configure PHD

If you are using an Ambari-managed cluster, use Ambari to update the Service configurations (core-site.xml, mapred-site.xml, webhbcat-site.xml and oozie-site.xml). Do not edit the configuration files directly and use Ambari to start and stop the services.

su $HDFS_USER /usr/phd/current/hadoop-client/sbin/hadoop-daemon.sh stop namenode

  • Modify the hdfs-site.xml file.

    On the NameNode, Secondary NameNode, and all the DataNodes, add the following property to the $HADOOP_CONF_DIR/hdfs-site.xml file, where $HADOOP_CONF_DIR is the directory for storing the Hadoop configuration files. For example, /etc/hadoop/conf.

    <property>
         <name>dfs.webhdfs.enabled</name>
         <value>true</value>
    </property>

  • Modify the core-site.xml file.

    On the NameNode, Secondary NameNode, and all the DataNodes, add the following properties to the $HADOOP_CONF_DIR/core-site.xml file, where $HADOOP_CONF_DIR is the directory for storing the Hadoop configuration files. For example, /etc/hadoop/conf.

    
    <property>
         <name>hadoop.proxyuser.hue.hosts</name>
         <value>*</value>
    </property>
    
    <property>
         <name>hadoop.proxyuser.hue.groups</name>
         <value>*</value>
    </property>
    
    <property>
         <name>hadoop.proxyuser.hcat.groups</name>
         <value>*</value>
    </property>
    
    <property>
         <name>hadoop.proxyuser.hcat.hosts</name>
         <value>*</value>
    </property>

  • Modify the webhcat-site.xml file. On the WebHCat Server host, add the following properties to the $WEBHCAT_CONF_DIR/webhcat-site.xml file, where $WEBHCAT_CONF_DIR is the directory for storing WebHCat configuration files. For example, /etc/webhcat/conf.

    vi $WEBHCAT_CONF_DIR/webhcat-site.xml 

    <property>      <name>webhcat.proxyuser.hue.hosts</name>      <value>*</value> </property> <property>       <name>webhcat.proxyuser.hue.groups</name>       <value>*</value> </property>

  • Modify the oozie-site.xml file. On the Oozie Server host, add the following properties to the $OOZIE_CONF_DIR/oozie-site.xml file, where $OOZIE_CONF_DIR is the directory for storing Oozie configuration files. For example, /etc/oozie/conf.

    vi $OOZIE_CONF_DIR/oozie-site.xml 

    <property>       <name>oozie.service.ProxyUserService.proxyuser.hue.hosts</name>       <value>*</value> </property> <property>      <name>oozie.service.ProxyUserService.proxyuser.hue.groups</name>      <value>*</value> </property>

  • Modify the hive-site.xml file. On the HiveServer2 host, add the following properties to the $HIVE_CONF_DIR/hive-site.xml, where $HIVE_CONF_DIR is the directory for storing Hive configuration files. For example, /etc/hive/conf.

    vi $HIVE_CONF_DIR/hive-site.xml

    <property>      <name>hive.server2.enable.impersonation</name>      <value>true</value> </property>

Install Hue

    Execute the following command on all Hue Server host machines:
  • For RHEL/CentOS Linux:

    yum install hue

  • For SLES:

    zypper install hue

Configure Hue

Use the following commands to explore the configuration options for Hue.

  • To list all available configuration options:

    /usr/phd/current/hue/build/env/bin/hue config_help | less

  • To use multiple files to store your configuration:

    Hue loads and merges all of the files with extension .ini located in the /etc/hue/conf directory.

Configure Web Server

Use the following instructions to configure Web server:

These configuration variables are under the [desktop] section in the hue.ini configuration file.

  • Specify the Hue HTTP Address.

    Use the following options to change the IP address and port of the existing Web Server for Hue (by default, CherryPy).

    # Webserver listens on this address and port http_host=0.0.0.0 http_port=8000

    The default setting is port 8000 on all configured IP addresses.

  • Specify the Secret Key.

    To ensure that your session cookies are secure, enter a series of random characters (30 to 60 characters is recommended) as shown below:

    secret_key=jFE93j;2[290-eiw.KEiwN2s3['d;/.q[eIW^y#e=+Iei*@Mn<qW5o

  • Configure authentication.

    By default, the first user who logs in to Hue can choose any username and password and gets the administrator privileges. This user can create other user and administrator accounts. User information is stored in the Django database in the Django backend.

  • (Optional.) Configure Hue for SSL.

    Install pyOpenSSL in order to configure Hue to serve over HTTPS. To install pyOpenSSL, from the root of your Hue installation path, complete the following instructions:

    • Execute the following command on the Hue Server:

      ./build/env/bin/easy_install pyOpenSSL

    • Configure Hue to use your private key. Add the following to hue.ini file:

      ssl_certificate=$PATH_To_CERTIFICATE ssl_private_key=$PATH_To_KEY ssl_cipher_list = "DEFAULT:!aNULL:!eNULL:!LOW:!EXPORT:!SSLv2" (default).

Configure Hadoop

Use the following instructions to configure Hadoop:

These configuration variables are under the [hadoop] section in the hue.ini configuration file.

  • Configure an HDFS Cluster. Hue supports only one HDFS cluster currently. Ensure that you define the HDFS cluster under the [hadoop][[hdfs_clusters]] [[[default]]] sub-section. Use the following variables to configure the HDFS cluster:

    fs_defaultfs This is equivalent to fs.defaultFS (fs.default.name) in Hadoop configuration. For example, hdfs:// fqdn-namenode.example.com:8020.

    webhdfs_url You can also set this to be the WebHDFS URL. The default value is the HTTP port on the NameNode. For example, http://fqdn-example.com:50070/webhdfs/v1

  • Configure a YARN (MR2) Cluster. Hue supports only one YARN cluster currently. Ensure that you define the YARN cluster under the [hadoop][[yarn_clusters]] [[[default]]] sub-section. Use the following variables to configure the YARN cluster:

    submit_to Set this property to true. Hue will be submitting jobs to this YARN cluster. But note that JobBrowser will not be able to show MR2 jobs.

    resourcemanager_api_url The URL of the ResourceManager API. For example, http://fqdn-resourcemanager.example.com:8088.

    proxy_api_url The URL of the ProxyServer API. For example, http://fqdn-resourcemanager.example.com:8088.

    history_server_api_url The URL of the HistoryServer API. For example, http://fqdn-historyserver.example.com:19888.

    node_manager_api_url The URL of the NodeManager API. For example, http://fqdn-resourcemanager.example.com:8042.

Configure Beeswax

In the [beeswax] section of the configuration file, you can specify the following:

hive_server_host Host where Hive server Thrift daemon is running. If Kerberos security is enabled, use fully-qualified domain name (FQDN). hive_server_port Port where HiveServer2 Thrift server runs on. For example: 10000

hive_conf_dir Hive configuration directory, where hive-site.xml is located. For example: /etc/hive/conf

server_conn_timeout Timeout in seconds for Thrift calls to HiveServer2. For example: 120

Optional - Configure Hue to communicate with HiveServer2 over SSL

(Optional) Use the following changes to hue.ini to configure Hue to communicate with HiveServer2 over SSL:

[[ssl]] SSL communication enabled for this server. enabled=false Path to Certificate Authority certificates. cacerts=/etc/hue/cacerts.pem Path to the public certificate file. cert=/etc/hue/cert.pem Choose whether Hue should validate certificates received from the server. validate=true

Configure JobDesigner and Oozie

In the [liboozie] section of the configuration file, specify:

oozie_url

The URL of the Oozie service as specified by the OOZIE_URL environment variable for Oozie.

Configure hive-site.xml

Modify the hive-site.xml file. On the HiveServer2 host, add the following properties to the $HIVE_CONF_DIR/hive-site.xml Where $HIVE_CONF_DIR is the directory for storing Hive configuration files. For example, /etc/hive/conf.

vi $HIVE_CONF_DIR/hive-site.xml <property> <name>hive.server2.enable.impersonation</name> <value>true</value> </property>

Configure WebHCat

In the [hcatalog] section of the hue.ini configuration file, update the following property:

templeton_url The hostname or IP of the WebHCat server. For example: http:// hostname:50111/templeton/v1/

Start Hue

As a root user, execute the following command on the Hue Server:

/etc/init.d/hue start

This command starts several subprocesses corresponding to the different Hue components.

To stop Hue, execute the following command:

/etc/init.d/hue stop To restart Hue, execute the following command:

/etc/init.d/hue restart

Validate Configuration

For any invalid configurations, Hue displays a red alert icon on the top navigation bar.

To view the current configuration of your Hue Server, select About > Configuration or http://hue-server.example.com:8000/dump_config.

Configuring Hue for an External Database

Using Hue with Oracle

    To set up Hue to use an Oracle database:
  • Create a new user in Oracle and grant privileges to manage this database using the Oracle database admin utility:

    
    # sqlplus sys/root as sysdba
    CREATE USER $HUEUSER IDENTIFIED BY $HUEPASSWORD default tablespace “USERS”temporary tablespace “TEMP”;
    GRANT CREATE TABLE, CREATE SEQUENCE, CREATE PROCEDURE, CREATE TRIGGER, CREATE SESSION, UNLIMITED TABLESPACE TO $HUEUSER;

    Where $HUEUSER is the Hue user name and $HUEPASSWORD is the Hue user password.

  • Open the /etc/hue/conf/hue.ini file and edit the [[database]] section (modify for your Oracle setup).

    
    [[database]]
    engine=oracle
    host=$DATABASEIPADDRESSORHOSTNAME
    port=$PORT
    user=$HUEUSER
    password=$HUEPASSWORD
    name=$DBNAME

  • Install the Oracle instant clients and configure cx_Oracle.

    • Download and extract the instantclient-basic-linux and instantclient-sdk-linux Oracle clients from Oracle's download site.

    • Set your ORACLE_HOME environment variable to point to the newly downloaded client libraries.

    • Set your LD_LIBRARY_PATH environment variable to include ORACLE_HOME.

    • Create symbolic link for library expected by cx_Oracle:

      ln -sf libclntsh.so.11.1 libclntsh.so

  • Install or Upgrade the DJango Python library south:

    pip install south --upgrade

  • Synchronize Hue with the external database to create the schema and load the data:

    /usr/phd/current/hue/build/env/bin/hue syncdb --noinput

  • Start Hue.

    /etc/init.d/hue start

Using Hue with MySQL

    To set up Hue to use a MySQL database:
  • Create a new user in MySQL and grant privileges to it to manage the database using the MySQL database admin utility:

    # mysql -u root -p

    CREATE USER $HUEUSER IDENTIFIED BY '$HUEPASSWORD';
    GRANT ALL PRIVILEGES on *.* to ‘$HUEUSER’@’localhost’ WITH GRANT OPTION;
    GRANT ALL on $HUEUSER.* to ‘$HUEUSER’@’localhost’ IDENTIFIED BY $HUEPASSWORD;
    FLUSH PRIVILEGES;

    where $HUEUSER is the Hue user name and $HUEPASSWORD is the Hue user password.

  • Create the MySQL database for Hue:

    # mysql -u root -p CREATE DATABASE $DBNAME;

  • Open the /etc/hue/conf/hue.ini file and edit the [[database]] section (modify for your MySQL setup).

    
    [[database]]
    engine=mysql
    host=$DATABASEIPADDRESSORHOSTNAME
    port=$PORT
    user=$HUEUSER
    password=$HUEPASSWORD
    name=$DBNAME

  • Synchronize Hue with the external database to create the schema and load the data:

    /usr/phd/current/hue/build/env/bin/hue syncdb --noinput

  • Start Hue:

    /etc/init.d/hue start

Using Hue with PostgreSQL

    To set up Hue to use a PostgreSQL database:
  • Create a database in PostgreSQL using the PostgreSQL database admin utility:

    sudo -u postgres psql CREATE DATABASE $DBNAME;

  • Exit the database admin utility.

    \q <enter>

  • Create the Hue user:

    sudo -u postgres psql -d $DBNAME CREATE USER $HUEUSER WITH PASSWORD '$HUEPASSWORD';

    where $HUEUSER is the Hue user name and $HUEPASSWORD is the Hue user password.

  • Open the /etc/hue/conf/hue.ini file and edit the [[database]] section (modify for your PostgreSQL setup).

    [[database]]
    engine=postgresql_psycopg2
    host=$DATABASEIPADDRESSORHOSTNAME
    port=$PORT
    user=$HUEUSER
    password=$HUEPASSWORD
    name=$DBNAME

  • Install the PostgreSQL database adapter for Python (psycopg2). For RHEL/CentOS Linux:

    yum install python-devel -y
    yum install postgresql-devel -y
    cd /usr/lib/hue
    source build/env/bin/activate
    pip install psycopg2 

  • Synchronize Hue with the external database to create the schema and load the data:

    /usr/phd/current/hue/build/env/bin/hue syncdb --noinput

  • Start Hue:

    /etc/init.d/hue start

Installing and Configuring Apache Spark

This section describes how to install and configure Apache Spark for PHD:

For more information about Spark on PHD (including how to install Spark using Ambari), see the Apache Spark Quick Start Guide.

Spark Prerequisites

Before installing Spark, make sure your cluster meets the following prerequisites:

Item

Prerequisite

Cluster Stack Version

PHD 3.0 or later stack

Components

Spark requires HDFS and YARN.

Installation

To install Spark, run the following commands as root:

  • For RHEL or CentOS: yum install spark yum install spark-python

  • For SLES: zypper install spark zypper install spark-python

When you install Spark, two directories will be created:

  • /usr/phd/current/spark-client for submitting Spark jobs

  • /usr/phd/current/spark-history for launching Spark master processes, such as the Spark history server

Configure Spark

To configure Spark, edit the following configuration files on all nodes that will run Spark jobs. These configuration files reside in the Spark client conf directory (/usr/phd/current/spark-client/conf) on each node.

  • java-opts

  • If you plan to use Hive with Spark, hive-site.xml

  • spark-env.sh

  • spark-defaults.conf

java-opts

Create a /etc/spark/conf/java-opts file for the Spark client. Add the following line to the file:

-Dphd.version=3.0.0.0-249 -Dstack.name=phd -Dstack.version=3.0.0.0-249

hive-site.xml

If you plan to use Hive with Spark, create a hive-site.xml file in the Spark client /conf directory. (Note: if you installed the Spark tech preview you can skip this step.)

In this file, add the hive.metastore.uris property and specify the Hive metastore as its value:

<property>     <name>hive.metastore.uris</name>     <value>thrift://c6401.ambari.apache.org:9083</value> </property>

spark-env.sh

Create a spark-env.sh file in the Spark client /conf directory, and make sure the file has the following entries:


# Location where log files are stored (default: ${SPARK_HOME}/logs)
# This can be any directory where the spark user has R/W access
export SPARK_LOG_DIR=/var/log/spark

# Location of the pid file (default: /tmp)
# This can be any directory where the spark user has R/W access
export SPARK_PID_DIR=/var/run/spark

if [ -d "/etc/tez/conf/" ]; then
  export TEZ_CONF_DIR=/etc/tez/conf
else
  export TEZ_CONF_DIR=
fi

These settings are required for starting Spark services (e.g., the History Service and the Thrift server). The user who starts Spark services needs to have read and write permissions to the log file and PID directory. By default these files are in the $SPARK_HOME directory, which is typically owned by root in RMP installation.

We also recommend that you set HADOOP_CONF_DIR to the appropriate directory; for example:

set HADOOP_CONF_DIR=/etc/hadoop/conf

This will minimize the amount of work you need to do to set up environment variables before running Spark applications.

spark-defaults.conf

Edit the spark-defaults.conf file in the Spark client /conf directory. Make sure the following values are specified, including hostname and port. (Note: if you installed the tech preview, these will already be in the file.)


spark.yarn.historyServer.address c6401.ambari.apache.org:18080
spark.history.ui.port 18080
spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService
spark.driver.extraJavaOptions -Dphd.version=3.0.0.0-249 -Dstack.version=3.0.0.0-249 -Dstack.name=phd
spark.history.provider org.apache.spark.deploy.yarn.history.YarnHistoryProvider
spark.yarn.am.extraJavaOptions -Dphd.version=3.0.0.0-249 -Dstack.version=3.0.0.0-249 -Dstack.name=phd

Create a Spark user

To use the Spark History Service, run Hive queries as the spark user, or run Spark jobs, the associated user must have sufficient HDFS access. One way of ensuring this is to add the user to the hdfs group.

The following example creates a spark user:

  • Create the spark user on all nodes. Add it to the hdfs group.

    useradd spark
    usermod -a -G hdfs spark

  • Create the spark user directory under /user/spark:

    sudo su $HDFS_USER
    hdfs dfs -mkdir -p /user/spark
    hdfs dfs -chown spark:spark /user/spark
    hdfs dfs -chmod -R 755 /user/spark

(Optional) Start Spark History Server

To validate Spark using the Spark History Service, start the service using the command:

/usr/phd/3.0.0.0-249/spark/sbin/start-history-server.sh

Validate Spark

To validate the Spark installation, run the following Spark jobs:

  • Spark Pi example

  • Spark WordCount example

For detailed instructions, see the Apache Spark Quick Start Guide.

Installing Knox

Apache Knox Gateway (Apache Knox) is the Web/REST API Gateway solution for Hadoop. It provides a single access point for all of Hadoop resources over REST. It also enables the integration of enterprise identity management solutions and numerous perimeter security features for REST/HTTP access to Hadoop.

Install the Knox RPMs on the Knox server

    To install the Knox RPM, run the following command as root:
  • RHEL/CentOS Linux:

    sudo yum install knox

  • For SLES:

    zypper install knox

    The installation creates the following:
  • knox user in /etc/passwd

  • Knox installation directory: /usr/phd/current/knox, which is referred to as $gateway_home.

  • Knox configuration directory: /etc/knox/conf

  • Knox log directory: /var/log/knox

Set up and Validate the Knox Gateway Installation

Setting up and validating the Knox Gateway installation requires a fully operational Hadoop Cluster that can be accessed from the gateway. This section explains how to get the gateway up and running and then test access to your existing cluster with the minimal configuration.

Use the setup in this section for initial gateway testing. For detailed configuration instructions, see the Knox Gateway Administrator Guide.

    To set up the gateway and test access:
  • Set the master secret:

    su -l knox -c "$gateway_home/bin/gateway.sh setup"

    You are prompted for the master secret, enter the password at the prompts.

  • Start the gateway:

    su -l knox -c /usr/phd/current/knox-server/bin/gateway.sh start
    
    Starting Gateway succeeded with PID 1871.

    The gateway starts and the PID is stored in /var/run/knox.

  • Start the demo LDAP service that contains the guest user account for testing:

    su -l knox -c "/usr/phd/current/knox-server/bin/ldap.sh start"
    
    Starting LDAP succeeded with PID 1965.
    
    
    In a production environment, we recommend using Active Directory or OpenLDAP for authentication. For detailed instructions on configuring Knox Gateway, see Configuring Authentication in the Knox Gateway Administrator Guide.

  • Verify that the gateway and LDAP service are running:

    su -l knox -c "$gateway_home/bin/gateway.sh status"
    Gateway is running with PID 1871.
    su -l knox -c "$gateway_home/bin/ldap.sh status"
    LDAP is running with PID 1965.

  • Confirm access from gateway host to WebHDFS Service host using telnet:

    telnet $webhdfs_host $webhdfs_port

  • Update the WebHDFS host information:

    • Open the $gateway_home/conf/topologies/sandbox.xml file in an editor such as vi.

    • Find the service definition for WebHDFS and update it as follows:

      <service>
          <role>WEBHDFS</role>
          <url>http://$webhdfs_host:$webhdfs_port/webhdfs</url>
      </service>

      where $webhdfs_host and $webhdfs_port (default port is 50070) match your environment.

    • (Optional) Comment out the Sandbox specific hostmap information:

      <!-- REMOVE SANDBOX HOSTMAP PROVIDER<provider>
      <role>hostmap</role>
      <name>static</name>
      <enabled>false</enabled>
      <param><name>localhost</name><value>sandbox,sandbox.pivotal.io</value></param>
      </provider>-->

  • (Optional) Rename the Sandbox Topology Descriptor file to match the name of your cluster:

    mv $gateway_home/conf/topologies/sandbox.xml
    $gateway_home/conf/ topologies/cluster-name.xml

    The gateway is now configured to allow access to WebHDFS.

  • On an external client that has curl, enter the following command:

    curl-k -ssl3 -u guest:guest-password -X GET "https://$gateway_host:8443/ gateway/sandbox/webhdfs/v1/?op=LISTSTATUS"

    where sandbox is the name of the cluster topology descriptor file that you created for testing. If you renamed it, then replace sandbox in the command above.

    $gateway_host is the Knox Gateway hostname. The status is returned.

Installing Ganglia (Deprecated)

This section describes installing and testing Ganglia, a system for monitoring and capturing metrics from services and components of the Hadoop cluster.

Install the Ganglia RPMs

On the host that you have chosen for the Ganglia server, install the server RPMs.

  • For RHEL/CentOS Linux:

    yum install ganglia-gmond-3.5.0-99 ganglia-gmetad-3.5.0-99 ganglia-web-3.5.7-99

  • For SLES:

    zypper install ganglia-gmond-3.5.0-99 ganglia-gmetad-3.5.0-99 ganglia-web-3.5.7-99

On each host in the cluster, install the client RPMs:

  • For RHEL/CentOS Linux:

    yum install ganglia-gmond-3.5.0-99

  • For SLES:

    zypper install ganglia-gmond-3.5.0-99

Install the Configuration Files

There are several configuration files that need to be set up for Ganglia.

Extract the Ganglia Configuration Files

From the file you downloaded in Download Companion Files, open the configuration_files.zip and copy the files in the ganglia folder to a temporary directory. The ganglia folder contains two sub-folders, objects and scripts.

Copy the Configuration Files

    On each host in the cluster:
  • Grant execute permissions on the following scripts:

    • /usr/libexec/phd/ganglia/setupGanglia.sh

    • /usr/libexec/phd/ganglia/startRrdcached.sh

  • Change permissions on the RRD base directory to grant access to nobody:

    chown -R nobody:nobody $RRDCACHED_BASE_DIR chmod -R 755 $RRDCACHED_BASE_DIR

  • Create the directory for the objects folder:

    mkdir -p /usr/libexec/phd/ganglia

  • Copy the object files:

    cp <tmp-directory>/ganglia/objects/*.* /usr/libexec/phd/ganglia/

  • Copy the Ganglia monitoring init script to init.d:

    cp <tmp-directory>/ganglia/scripts/phd-gmond /etc/init.d

  • On the Ganglia Server Host, copy the entire contents of the scripts folder to init.d:

    cp -R <tmp-directory>/ganglia/scripts/* /etc/init.d/

Set Up Ganglia Hosts

  • On the Ganglia server, to configure the gmond collector:

    /usr/libexec/phd/ganglia/setupGanglia.sh -c PHDHistoryServer -m /usr/libexec/phd/ganglia/setupGanglia.sh -c PHDNameNode -m /usr/libexec/phd/ganglia/setupGanglia.sh -c PHDSlaves -m /usr/libexec/phd/ganglia/setupGanglia.sh -t

  • If HBase is installed, on the HBase Master:

    /usr/libexec/phd/ganglia/setupGanglia.sh -c PHDHBaseMaster -m

  • On the NameNode and SecondaryNameNode servers, to configure the gmond emitters:

    /usr/libexec/phd/ganglia/setupGanglia.sh -c PHDNameNode

  • On the ResourceManager server, to configure the gmond emitters:

    /usr/libexec/phd/ganglia/setupGanglia.sh -c PHDResourceManager

  • On all hosts, to configure the gmond emitters:

    /usr/libexec/phd/ganglia/setupGanglia.sh -c PHDSlaves

  • If HBase is installed, on the HBase Master, to configure the gmond emitter:

    /usr/libexec/phd/ganglia/setupGanglia.sh -c PHDHBaseMaster

Set Up Configurations

  • On the Ganglia server, use a text editor to open the following master configuration files:

    /etc/ganglia/phd/PHDNameNode/conf.d/gmond.master.conf /etc/ganglia/phd/PHDHistoryServer/conf.d/gmond.master.conf /etc/ganglia/phd/PHDResourceManager/conf.d/gmond.slave.conf /etc/ganglia/phd/PHDSlaves/conf.d/gmond.master.conf

    And if HBase is installed:

    /etc/ganglia/phd/PHDHBaseMaster/conf.d/gmond.master.conf

  • Confirm that the “bind” property in each of these files is set to the Ganglia server hostname.

  • On the Ganglia server, use a text editor to open the gmetad configuration file:

    /etc/ganglia/phd/gmetad.conf

  • Confirm the "data_source" properties are set to the Ganglia server hostname. For example:

    data_source "PHDSlaves" my.ganglia.server.hostname:8660 data_source "PHDNameNode" my.ganglia.server.hostname:8661 data_source "PHDResourceManager" my.ganglia.server.hostname:8664 data_source "PHDHistoryServer" my.ganglia.server.hostname:8666

    And if HBase is installed:

    data_source "PHDHBaseMaster" my.ganglia.server.hostname:8663

  • On all hosts except the Ganglia server, use a text editor to open the slave configuration files:

    /etc/ganglia/phd/PHDNameNode/conf.d/gmond.slave.conf/ etc/ganglia/phd/PHDHistoryServer/conf.d/gmond.slave.conf/ etc/ganglia/phd/PHDResourceManager/conf.d/gmond.slave.conf/ etc/ganglia/phd/PHDSlaves/conf.d/gmond.slave.conf

    And if HBase is installed:

    /etc/ganglia/phd/PHDHBaseMaster/conf.d/gmond.slave.conf

  • Confirm that the host property is set to the Ganglia Server hostname.

Set Up Hadoop Metrics

    On each host in the cluster:
  • Stop the Hadoop services.

  • Change to the Hadoop configuration directory.

    cd $HADOOP_CONF_DIR

  • Copy the Ganglia metrics properties file into place.

    mv hadoop-metrics2.properties-GANGLIA hadoop-metrics2.properties

  • Edit the metrics properties file and set the Ganglia server hostname.

    namenode.sink.ganglia.servers=my.ganglia.server.hostname:8661 datanode.sink.ganglia.servers=my.ganglia.server.hostname:8660 resourcemanager.sink.ganglia.servers=my.ganglia.server.hostname:8664 nodemanager.sink.ganglia.servers=my.ganglia.server.hostname:8660 historyserver.sink.ganglia.servers=my.ganglia.server.hostname:8666 maptask.sink.ganglia.servers=my.ganglia.server.hostname:8660 reducetask.sink.ganglia.servers=my.ganglia.server.hostname:8660

  • Restart the Hadoop services.

Validate the Installation

Use these steps to validate your installation.

Start the Ganglia Server

On the Ganglia server:

service httpd restart /etc/init.d/phd-gmetad start

Start Ganglia Monitoring on All Hosts

On all hosts:

/etc/init.d/phd-gmond start

Confirm that Ganglia is Running

Browse to the Ganglia server:

http://{ganglia.server}/ganglia

Installing Nagios (Deprecated)

This section describes installing and testing Nagios, a system that monitors Hadoop cluster components and issues alerts on warning and critical conditions.

Install the Nagios RPMs

On the host you have chosen for the Nagios server, install the RPMs:

  • For RHEL and CentOS:

    
    yum -y install net-snmp net-snmp-utils php-pecl-json
    yum -y install wget httpd php net-snmp-perl perl-Net-SNMP fping nagios nagios- plugins nagios-www

  • For SLES:

    
    zypper -n --no-gpg-checks install net-snmp
    zypper -n --no-gpg-checks install wget apache2 php php-curl perl-SNMP perl- Net-SNMP fping nagios nagios-plugins nagios-www

Install the Configuration Files

There are several configuration files that must be set up for Nagios.

Extract the Nagios Configuration Files

From the file you downloaded in Download Companion Files, open the configuration_files.zip and copy the files in the nagios folder to a temporary directory. The nagios folder contains two sub-folders, objects and plugins.

Create the Nagios Directories

  • Create the following Nagios directories:

    mkdir /var/nagios /var/nagios/rw /var/log/nagios /var/log/nagios/spool/checkresults /var/run/nagios

  • Change ownership on those directories to the Nagios user:

    chown -R nagios:nagios /var/nagios /var/nagios/rw /var/log/nagios /var/log/nagios/spool/checkresults /var/run/nagios

Copy the Configuration Files

  • Copy the contents of the objects folder into place:

    cp <tmp-directory>/nagios/objects/*.* /etc/nagios/objects/

  • Copy the contents of the plugins folder into place:

    cp <tmp-directory>/nagios/plugins/*.* /usr/lib64/nagios/plugins/

Set the Nagios Admin Password

  • Choose a Nagios administrator password, for example, “admin”.

  • Set the password. Use the following command:

    htpasswd -c -b /etc/nagios/htpasswd.users nagiosadmin admin

Set the Nagios Admin Email Contact Address

  • Open /etc/nagios/objects/contacts.cfg with a text editor.

  • Change the nagios@localhost value to the admin email address so it can receive alerts.

Register the Hadoop Configuration Files

  • Open /etc/nagios/nagios.cfg with a text editor.

  • In the section OBJECT CONFIGURATION FILE(S), add the following:

    
    # Definitions for hadoop servers
    cfg_file=/etc/nagios/objects/hadoop-commands.cfg
    cfg_file=/etc/nagios/objects/hadoop-hosts.cfg
    cfg_file=/etc/nagios/objects/hadoop-hostgroups.cfg
    cfg_file=/etc/nagios/objects/hadoop-services.cfg
    cfg_file=/etc/nagios/objects/hadoop-servicegroups.cfg

  • Change the command-file directive to /var/nagios/rw/nagios.cmd:

    command_file=/var/nagios/rw/nagios.cmd

Set Hosts

  • Open /etc/nagios/objects/hadoop-hosts.cfg with a text editor.

  • Create a "define host { … }" entry for each host in your cluster using the following format:

    
    define host {
        alias @HOST@
        host_name @HOST@
        use linux-server
        address @HOST@
        check_interval 0.25
        retry_interval 0.25
        max_check_attempts 4
        notifications_enabled 1
        first_notification_delay 0 # Send notification soon after
                                   # change in the hard state
        notification_interval 0    # Send the notification once
        notification_options   d,u,r
      }

  • Replace "@HOST@" with the hostname.

Set Host Groups

  • Open /etc/nagios/objects/hadoop-hostgroups.cfg with a text editor.

  • Create host groups based on all the hosts and services you have installed in your cluster. Each host group entry should follow this format:

    
    define hostgroup {
           hostgroup_name@NAME@
           alias@ALIAS@
           members@MEMBERS@
      }

    The parameters (such as @NAME@) are defined in the following table.

    Parameter

    Description

    @NAME@

    The host group name

    @ALIAS@

    The host group alias

    @MEMBERS@

    A comma-separated list of hosts in the group

  • The following table lists the core and monitoring host groups:

    Service

    Component

    Name

    Alias

    Members

    All servers in the cluster

    all-servers

    All Servers

    List all servers in the cluster

    HDFS

    NameNode

    namenode

    namenode

    The NameNode host

    HDFS

    SecondaryNameNode

    snamenode

    snamenode

    The Secondary NameNode host

    MapReduce

    JobTracker

    jobtracker

    jobtracker

    The Job Tracker host

    HDFS, MapReduce

    Slaves

    slaves

    slaves

    List all hosts running DataNode and TaskTrackers

    Nagios

    nagios-server

    nagios-server

    The Nagios server host

    Ganglia

    ganglia-server

    ganglia-server

    The Ganglia server host

  • The following table lists the ecosystem project host groups:

    Service

    Component

    Name

    Alias

    Members

    HBase

    Master

    hbasemaster

    hbasemaster

    List the master server

    HBase

    Region

    region-servers

    region-servers

    List all region servers

    ZooKeeper

    zookeeper-servers

    zookeeper-servers

    List all ZooKeeper servers

    Oozie

    oozie-server

    oozie-server

    The Oozie server

    Hive

    hiveserver

    hiveserver

    The Hive metastore server

    WebHCat

    webhcat-server

    webhcat-server

    The WebHCat server

    Templeton

    templeton-server

    templeton-server

    The Templeton server

Set Services

  • Open /etc/nagios/objects/hadoop-services.cfg with a text editor.

    This file contains service definitions for the following services: Ganglia, HBase (Master and Region), ZooKeeper, Hive, Templeton and Oozie

  • Remove any services definitions for services you have not installed.

  • Replace the parameters @NAGIOS_BIN@ and @STATUS_DAT@ based on the operating system.

    • For RHEL and CentOS:

      @STATUS_DAT@ = /var/nagios/status.dat@NAGIOS_BIN@ = /usr/bin/nagios

    • For SLES:

      @STATUS_DAT@ = /var/lib/nagios/status.dat@NAGIOS_BIN@ = /usr/sbin/nagios

  • If you have installed Hive or Oozie services, replace the parameter @JAVA_HOME@ with the path to the Java home. For example, /usr/java/default.

Set Status

  • Open /etc/nagios/objects/hadoop-commands.cfg with a text editor.

  • Replace the @STATUS_DAT@ parameter with the location of the Nagios status file. File location depends on your operating system.

    • For RHEL and CentOS:

      /var/nagios/status.dat

    • For SLES:

      /var/lib/nagios/status.dat

Add Templeton Status and Check TCP Wrapper Commands

  • Open /etc/nagios/objects/hadoop-commands.cfg with a text editor.

  • Add the following commands:

    define command{
            command_name    check_templeton_status
            command_line    $USER1$/check_wrapper.sh $USER1$/check_templeton_status.sh $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ $ARG5$ $ARG6$ $ARG7$
           }
    
    define command{
            command_name    check_tcp_wrapper
            command_line    $USER1$/check_wrapper.sh $USER1$/check_tcp -H $HOSTADDRESS$ -p $ARG1$ $ARG2$
           }

Validate the Installation

Follow these steps to validate your installation.

  • Validate the Nagios installation:

    nagios -v /etc/nagios/nagios.cfg

  • Start the Nagios server and httpd:

    /etc/init.d/nagios start/etc/init.d/httpd start

  • Confirm that the Nagios server is running:

    /etc/init.d/nagios status

    This should return:

    nagios (pid #) is running...

  • To test Nagios Services, run the following command:

    /usr/lib64/nagios/plugins/check_hdfs_capacity.php -h namenode_hostname -p 50070 -w 80% -c 90%

    This should return:

    OK: DFSUsedGB:<some#>, DFSTotalGB:<some#>

  • To test Nagios Access, browse to the Nagios server.

    http://<nagios.server>/nagios

    Login using the Nagios admin username (nagiosadmin) and password (see Set the Nagios Admin Password). Click on hosts to check that all hosts in the cluster are listed. Click on services to check that all of the Hadoop services are listed for each host.

  • Test Nagios alerts.

    • Login to one of your cluster DataNodes.

    • Stop the TaskTracker service:

      su -l mapred -c "/usr/phd/current/hadoop/bin/hadoop-daemon.sh --config /etc/hadoop/ conf stop tasktracker"

    • Validate that you received an alert at the admin email address, and that you have critical state showing on the console.

    • Start the TaskTracker service.

      su -l mapred -c "/usr/phd/current/hadoop/bin/hadoop-daemon.sh --config /etc/hadoop/ conf start tasktracker"

    • Validate that you received an alert at the admin email address, and that critical state is cleared on the console.

Setting Up Security for Manual Installs

This section provides information on enabling security for a manually installed version of PHD.

Preparing Kerberos

This subsection provides information on setting up Kerberos for an PHD installation.

Kerberos Overview

To create secure communication among its various components, PHD uses Kerberos. Kerberos is a third-party authentication mechanism, in which users and services that users wish to access rely on a the Kerberos server to authenticate each to the other. This mechanism also supports encrypting all traffic between the user and the service. The

    The Kerberos server itself is known as the Key Distribution Center, or KDC. At a high level, it has three parts:
  • A database of the users and services (known as principals) that it knows about and their respective Kerberos passwords

  • An authentication server (AS) which performs the initial authentication and issues a Ticket Granting Ticket (TGT)

  • A Ticket Granting Server (TGS) that issues subsequent service tickets based on the initial TGT.

A user principal requests authentication from the AS. The AS returns a TGT that is encrypted using the user principal's Kerberos password, which is known only to the user principal and the AS. The user principal decrypts the TGT locally using its Kerberos password, and from that point forward, until the ticket expires, the user principal can use the TGT to get service tickets from the TGS.

Because a service principal cannot provide a password each time to decrypt the TGT, it uses a special file, called a keytab, which contains its authentication credentials.

The service tickets allow the principal to access various services. The set of hosts, users, and services over which the Kerberos server has control is called a realm.

Installing and Configuring the KDC

To use Kerberos with PHD, either use an existing KDC or install a new one for PHD only. The following gives a very high level description of the installation process. For more information, see RHEL documentation , CentOS documentation, SLES documentation.

  • Install the KDC server:

    • On RHEL, CentOS Linux, run:

      yum install krb5-server krb5-libs krb5-auth-dialog krb5-workstation

    • On SLES, run:

      zypper install krb5 krb5-server krb5-client

    When the server is installed you must edit the two main configuration files, located by default here:
  • Update the KDC configuration by replacing EXAMPLE.COM with your domain and kerberos.example.com with the FQDN of the KDC host; the configuration files are located:

    • On RHEL, CentOS Linux:

      • /etc/krb5.conf

      • /var/kerberos/krb5kdc/kdc.conf

    • On SLES:

      • /etc/krb5.conf

      • /var/lib/kerberos/krb5kdc/kdc.conf

  • Copy the updated krb5.conf to every cluster node.

Creating the Database and Setting Up the First Administrator

  • Use the utility kdb5_util to create the Kerberos database:

    • On RHEL, CentOS Linux:

      /usr/sbin/kdb5_util create -s

    • On SLES:

      kdb5_util create -s

  • Set up the KDC Access Control List (ACL):

    • On RHEL, CentOS Linux add administrators to /var/kerberos/ krb5kdc/kadm5.acl.

    • On SLES, add administrators to /var/lib/kerberos/krb5kdc/kadm5.acl.

  • Restart kadmin for the change to take effect.

  • Create the first user principal. This must be done at a terminal window on the KDC machine itself, while you are logged in as root. Notice the .local. Normal kadmin usage requires that a principal with appropriate access already exist. The kadmin.local command can be used even if no principals exist.

    /usr/sbin/kadmin.local -q "addprinc $username/admin"

    Now this user can create additional principals either on the KDC machine or through the network. The following instruction assume you are using the KDC machine.

  • On the KDC, start Kerberos:

    • On RHEL, CentOS Linux:

      /sbin/service krb5kdc start /sbin/service kadmin start

    • On SLES:

      rckrb5kdc start rckadmind start

Creating Service Principals and Keytab Files for PHD

Each service in PHD must have its own principal. As services do not login with a password to acquire their tickets, their principal's authentication credentials are stored in a keytab file, which is extracted from the Kerberos database and stored locally with the service principal.

First you must create the principal, using mandatory naming conventions. Then you must create the keytab file with that principal's information, and copy the file to the keytab directory on the appropriate service host.

  • Create a service principal using the kadmin utility:

    kadmin: addprinc -randkey $principal_name/$service-host-FQDN@$hadoop.realm

    You must have a principal with administrative permissions to use this command. The randkey is used to generate the password.

    In the example each service principal's name has appended to it the fully qualified domain name of the host on which it is running. This is to provide a unique principal name for services that run on multiple hosts, like DataNodes and TaskTrackers. The addition of the hostname serves to distinguish, for example, a request from DataNode A from a request from DataNode B. This is important for two reasons:

    • If the Kerberos credentials for one DataNode are compromised, it does not automatically lead to all DataNodes being compromised

    • If multiple DataNodes have exactly the same principal and are simultaneously connecting to the NameNode, and if the Kerberos authenticator being sent happens to have same timestamp, then the authentication would be rejected as a replay request.

      The $principal_name part of the name must match the values in the following table.

      The NameNode, Secondary NameNode, and Oozie require two principals each.

      If you are configuring High Availability (HA) for a Quorom-based NameNode, you must also generate a principle (jn/$FQDN) and keytab (jn.service.keytab) for each JournalNode. JournalNode also requires the keytab for its HTTP service. If the JournalNode is deployed on the same host as a NameNode, the same keytab file (spnego.service.keytab) can be used for both. In addition, HA requires two NameNodes. Both the active and standby NameNodes require their own principle and keytab files. The service principles of the two NameNodes can share the same name, specified with the dfs.namenode.kerberos.principal property in hdfs-site.xml, but the NameNodes still have different fully qualified domain names.

      Service Principals

      Service

      Component

      Mandatory Principal Name

      HDFS

      NameNode

      nn/$FQDN

      HDFS

      NameNode HTTP

      HTTP/$FQDN

      HDFS

      SecondaryNameNode

      nn/$FQDN

      HDFS

      SecondaryNameNode HTTP

      HTTP/$FQDN

      HDFS

      DataNode

      dn/$FQDN

      MR2

      History Server

      jhs/$FQDN

      MR2

      History Server HTTP

      HTTP/$FQDN

      YARN

      ResourceManager

      rm/$FQDN

      YARN

      NodeManager

      nm/$FQDN

      Oozie

      Oozie Server

      oozie/$FQDN

      Oozie

      Oozie HTTP

      HTTP/$FQDN

      Hive

      Hive Metastore

      HiveServer2

      hive/$FQDN

      Hive

      WebHCat

      HTTP/$FQDN

      HBase

      MasterServer

      hbase/$FQDN

      HBase

      RegionServer

      hbase/$FQDN

      Hue

      Hue Interface

      hue/$FQDN

      ZooKeeper

      ZooKeeper

      zookeeper/$FQDN

      Nagios Server

      Nagios

      nagios/$FQDN

      JournalNode Server[2]

      JournalNode

      jn/$FQDN

      Gateway

      Knox

      knox/$FQDN

      For example: To create the principal for a DataNode service, issue this command:

      kadmin: addprinc -randkey dn/$datanode-host@$hadoop.realm

  • Extract the related keytab file and place it in the keytab directory (by default / etc/krb5.keytab) of the appropriate respective components:

    kadmin: xst -k $keytab_file_name $principal_name/fully.qualified.domain.name

    You must use the mandatory names for the $keytab_file_name variable shown in the following table.

    Service Keytab File Names

    Component

    Principal Name

    Mandatory Keytab File Name

    NameNode

    nn/$FQDN

    nn.service.keytab

    NameNode HTTP

    HTTP/$FQDN

    spnego.service.keytab

    SecondaryNameNode

    nn/$FQDN

    nn.service.keytab

    SecondaryNameNode HTTP

    HTTP/$FQDN

    spnego.service.keytab

    DataNode

    dn/$FQDN

    dn.service.keytab

    MR2 History Server

    jhs/$FQDN

    nm.service.keytab

    MR2 History Server HTTP

    HTTP/$FQDN

    spnego.service.keytab

    YARN

    rm/$FQDN

    rm.service.keytab

    YARN

    nm/$FQDN

    nm.service.keytab

    Oozie Server

    oozie/$FQDN

    oozie.service.keytab

    Oozie HTTP

    HTTP/$FQDN

    spnego.service.keytab

    Hive Metastore

    HiveServer2

    hive/$FQDN

    hive.service.keytab

    WebHCat

    HTTP/$FQDN

    spnego.service.keytab

    HBase Master Server

    hbase/$FQDN

    hbase.service.keytab

    HBase RegionServer

    hbase/$FQDN

    hbase.service.keytab

    Hue

    hue/$FQDN

    hue.service.keytab

    ZooKeeper

    zookeeper/$FQDN

    zk.service.keytab

    Nagios Server

    nagios/$FQDN

    nagios.service.keytab

    Journal Server[3]

    jn/$FQDN

    jn.service.keytab

    Knox Gateway[4]

    knox/$FQDN

    knox.service.keytab

    For example: To create the keytab files for the NameNode, issue these commands:

    kadmin: xst -k nn.service.keytab nn/$namenode-host kadmin: xst -k spnego.service.keytab HTTP/$namenode-host

    When you have created the keytab files, copy them to the keytab directory of the respective service hosts.

  • Verify that the correct keytab files and principals are associated with the correct service using the klist command. For example, on the NameNode:

    klist –k -t /etc/security/nn.service.keytab

    Do this on each respective service in your cluster.

Configuring PHD

This section provides information on configuring PHD for Kerberos.

Configuration Overview

    Configuring PHD for Kerberos has two parts:
  • Creating a mapping between service principals and UNIX usernames.

    Hadoop uses group memberships of users at various places, such as to determine group ownership for files or for access control.

    A user is mapped to the groups it belongs to using an implementation of the GroupMappingServiceProvider interface. The implementation is pluggable and is configured in core-site.xml.

    By default Hadoop uses ShellBasedUnixGroupsMapping, which is an implementation of GroupMappingServiceProvider. It fetches the group membership for a username by executing a UNIX shell command. In secure clusters, since the usernames are actually Kerberos principals, ShellBasedUnixGroupsMapping will work only if the Kerberos principals map to valid UNIX usernames. Hadoop provides a feature that lets administrators specify mapping rules to map a Kerberos principal to a local UNIX username.

  • Adding information to three main service configuration files.

    There are several optional entries in the three main service configuration files that must be added to enable security on PHD.

Creating Mappings Between Principals and UNIX Usernames

PHD uses a rule-based system to create mappings between service principals and their related UNIX usernames. The rules are specified in the core-site.xml configuration file as the value to the optional key hadoop.security.auth_to_local.

The default rule is simply named DEFAULT. It translates all principals in your default domain to their first component. For example, myusername@APACHE.ORG and myusername/admin@APACHE.ORG both become myusername, assuming your default domain is APACHE.ORG.

    Creating Rules To accommodate more complex translations, you can create a hierarchical set of rules to add to the default. Each rule is divided into three parts: base, filter, and substitution.
  • The Base

    The base begins with the number of components in the principal name (excluding the realm), followed by a colon, and the pattern for building the username from the sections of the principal name. In the pattern section $0 translates to the realm, $1 translates to the first component and $2 to the second component.

    For example:

    [1:$1@$0] translates myusername@APACHE.ORG to myusername@APACHE.ORG [2:$1] translates myusername/admin@APACHE.ORG to myusername [2:$1%$2] translates myusername/admin@APACHE.ORG to “myusername%admin

  • The Filter

    The filter consists of a regular expression (regex) in a parentheses. It must match the generated string for the rule to apply.

    For example:

    (.*%admin) matches any string that ends in %admin (.*@SOME.DOMAIN) matches any string that ends in @SOME.DOMAIN

  • The Substitution

    The substitution is a sed rule that translates a regex into a fixed string. For example:

    s/@ACME\.COM// removes the first instance of @SOME.DOMAIN s/@[A-Z]*\.COM// removes the first instance of @ followed by a name followed by COM. s/X/Y/g replaces all of X's in the name with Y

Examples

  • If your default realm was APACHE.ORG, but you also wanted to take all principals from ACME.COM that had a single component joe@ACME.COM, the following rule would do this:

    RULE:[1:$1@$0](.@ACME.COM)s/@.// DEFAULT

  • To translate names with a second component, you could use these rules:

    RULE:[1:$1@$0](.@ACME.COM)s/@.// RULE:[2:$1@$0](.@ACME.COM)s/@.// DEFAULT

  • To treat all principals from APACHE.ORG with the extension /admin as admin, your rules would look like this:

    RULE[2:$1%$2@$0](.%admin@APACHE.ORG)s/./admin/ DEFAULT

Adding Security Information to Configuration Files

To enable security on PHD, you must add optional information to various configuration files.

Before you begin, set JSVC_Home in hadoop-env.sh.

  • For RHEL/CentOS Linux:

    export JSVC_HOME=/usr/libexec/bigtop-utils

  • For SLES:

    export JSVC_HOME=/usr/phd/current/bigtop-utils

core-site.xml

To the core-site.xml file on every host in your cluster, you must add the following information:

General core-site.xml, Knox, and Hue

Property Name

Property Value

Description

hadoop.security.authentication

kerberos

Set the authentication type for the cluster. Valid values are: simple or kerberos.

hadoop.rpc.protection

authentication; integrity; privacy

This is an [OPTIONAL] setting. If not set, defaults to authentication.

authentication= authentication only; the client and server mutually authenticate during connection setup.

integrity = authentication and integrity; guarantees the integrity of data exchanged between client and server as well as authentication.

privacy = authentication, integrity, and confidentiality; guarantees that data exchanged between client and server is encrypted and is not readable by a “man in the middle”.

hadoop.security.authorization

true

Enable authorization for different protocols.

hadoop.security.auth_to_local

The mapping rules. For example, RULE:[2:$1@$0]([jt]t@.*EXAMPLE.COM)s/.*/mapred/ RULE:[2:$1@$0]([nd]n@.*EXAMPLE.COM)s/.*/hdfs/ RULE:[2:$1@$0](hm@.*EXAMPLE.COM)s/.*/hbase/ RULE:[2:$1@$0](rs@.*EXAMPLE.COM)s/.*/hbase/ DEFAULT

The mapping from Kerberos principal names to local OS user names. See Creating Mappings Between Principals and UNIX Usernames for more information.

Following is the XML for these entries:


 <property>
        <name>hadoop.security.authentication</name>
        <value>kerberos</value>
        <description> Set the authentication for the cluster.
        Valid values are: simple or kerberos.</description>
</property>

<property>
        <name>hadoop.security.authorization</name>
        <value>true</value>
        <description>Enable authorization for different protocols.</description>
</property>

<property>
        <name>hadoop.security.auth_to_local</name>
        <value>
        RULE:[2:$1@$0]([jt]t@.*EXAMPLE.COM)s/.*/mapred/
        RULE:[2:$1@$0]([nd]n@.*EXAMPLE.COM)s/.*/hdfs/
        RULE:[2:$1@$0](hm@.*EXAMPLE.COM)s/.*/hbase/
        RULE:[2:$1@$0](rs@.*EXAMPLE.COM)s/.*/hbase/
        DEFAULT
        </value>
        <description>The mapping from kerberos principal names
        to local OS user names.</description>
</property>

When using the Knox Gateway, add the following to the core-site.xml file on the master nodes host in your cluster:

Property Name

Property Value

Description

hadoop.proxyuser.knox.groups

users

Grants proxy privileges for knox user.

hadoop.proxyuser.knox.hosts

$knox_host_FQDN

Identifies the Knox Gateway host.

When using Hue, add the following to the core-site.xml file on the master nodes host in your cluster:

Property Name

Property Value

Description

hue.kerberos.principal.shortname

hue

Group to which all the hue users belong. Use the wild card character to select multiple groups, for example cli*.

hadoop.proxyuser.hue.groups

*

Group to which all the hue users belong. Use the wild card character to select multiple groups, for example cli*.

hadoop.proxyuser.hue.hosts

*

hadoop.proxyuser.knox.hosts

$hue_host_FQDN

Identifies the Knox Gateway host.

Following is the XML for both Knox and Hue settings:


<property>
        <name>hadoop.security.authentication</name>
        <value>kerberos</value>
        <description>Set the authentication for the cluster.
        Valid values are: simple or kerberos.</description>
</property>

<property>
        <name>hadoop.security.authorization</name>
        <value>true</value>
        <description>Enable authorization for different protocols.
        </description>
</property>

<property>
        <name>hadoop.security.auth_to_local</name>
        <value>
        RULE:[2:$1@$0]([jt]t@.*EXAMPLE.COM)s/.*/mapred/
        RULE:[2:$1@$0]([nd]n@.*EXAMPLE.COM)s/.*/hdfs/
        RULE:[2:$1@$0](hm@.*EXAMPLE.COM)s/.*/hbase/
        RULE:[2:$1@$0](rs@.*EXAMPLE.COM)s/.*/hbase/
        DEFAULT</value> <description>The mapping from kerberos principal names
        to local OS user names.</description>
</property>

<property>
    <name>hadoop.proxyuser.knox.groups</name>
    <value>users</value>
</property>

<property>
    <name>hadoop.proxyuser.knox.hosts</name>
    <value>Knox.EXAMPLE.COM</value>
</property>               

hdfs-site.xml

To the hdfs-site.xml file on every host in your cluster, you must add the following information:

Property Name

Property Value

Description

dfs.permissions.enabled

true

If true, permission checking in HDFS is enabled. If false, permission checking is turned off, but all other behavioris unchanged. Switching from one parameter value to the other does not change the mode, owner or group of files or directories.

dfs.permissions.supergroup

hdfs

The name of the group of super-users.

dfs.block.access.token.enable

true

If true, access tokens are used as capabilities for accessing DataNodes. If false, no access tokens are checked on accessing DataNodes.

dfs.namenode.kerberos.principal

nn/_HOST@EXAMPLE.COM

Kerberos principal name for the NameNode.

dfs.secondary.namenode.kerberos.principal

nn/_HOST@EXAMPLE.COM

Kerberos principal name for the secondary NameNode.

dfs.web.authentication.kerberos.principal

HTTP/_HOST@EXAMPLE.COM

The HTTP Kerberos principal used by Hadoop-Auth in the HTTP endpoint.

The HTTP Kerberos principal MUST start with 'HTTP/' per Kerberos HTTP SPNEGO specification.

dfs.web.authentication.kerberos.keytab

/etc/security/keytabs/spnego.service.keytab

The Kerberos keytab file with the credentials for the HTTP Kerberos principal used by Hadoop-Auth in the HTTP endpoint.

dfs.datanode.kerberos.principal

dn/_HOST@EXAMPLE.COM

The Kerberos principal that the DataNode runs as. "_HOST" is replaced by the real host name .

dfs.namenode.keytab.file

/etc/security/keytabs/nn.service.keytab

Combined keytab file containing the NameNode service and host principals.

dfs.secondary.namenode.keytab.file

/etc/security/keytabs/nn.service.keytab

Combined keytab file containing the NameNode service and host principals. <question?>

dfs.datanode.keytab.file

/etc/security/keytabs/dn.service.keytab

The filename of the keytab file for the DataNode.

dfs.https.port

50470

The HTTPS port to which the NameNode binds

dfs.namenode.https-address

Example:

ip-10-111-59-170.ec2.internal:50470

The HTTPS address to which the NameNode binds

dfs.datanode.data.dir.perm

750

The permissions that must be set on the dfs.data.dir directories. The DataNode will not come up if all existing dfs.data.dir directories do not have this setting. If the directories do not exist, they will be created with this permission

dfs.cluster.administrators

hdfs

ACL for who all can view the default servlets in the HDFS

dfs.namenode.kerberos.internal.spnego.principal

${dfs.web.authentication.kerberos.principal}

dfs.secondary.namenode.kerberos.internal.spnego.principal

${dfs.web.authentication.kerberos.principal}

Following is the XML for these entries:


<property>
        <name>dfs.permissions</name>
        <value>true</value>
        <description> If "true", enable permission checking in
        HDFS. If "false", permission checking is turned
        off, but all other behavior is
        unchanged. Switching from one parameter value to the other does
        not change the mode, owner or group of files or
        directories. </description>
</property>

<property>
        <name>dfs.permissions.supergroup</name>
        <value>hdfs</value>
        <description>The name of the group of
        super-users.</description>
</property>

<property>
        <name>dfs.namenode.handler.count</name>
        <value>100</value>
        <description>Added to grow Queue size so that more
        client connections are allowed</description>
</property>

<property>
        <name>ipc.server.max.response.size</name>
        <value>5242880</value>
</property>

<property>
        <name>dfs.block.access.token.enable</name>
        <value>true</value>
        <description> If "true", access tokens are used as capabilities
        for accessing datanodes. If "false", no access tokens are checked on
        accessing datanodes. </description>
</property>

<property>
        <name>dfs.namenode.kerberos.principal</name>
        <value>nn/_HOST@EXAMPLE.COM</value>
        <description> Kerberos principal name for the
        NameNode </description>
</property>

<property>
        <name>dfs.secondary.namenode.kerberos.principal</name>
        <value>nn/_HOST@EXAMPLE.COM</value>
        <description>Kerberos principal name for the secondary NameNode.
        </description>
</property>

<property>
        <!--cluster variant -->
        <name>dfs.secondary.http.address</name>
        <value>ip-10-72-235-178.ec2.internal:50090</value>
        <description>Address of secondary namenode web server</description>
</property>

<property>
        <name>dfs.secondary.https.port</name>
        <value>50490</value>
        <description>The https port where secondary-namenode
        binds</description>
</property>

<property>
        <name>dfs.web.authentication.kerberos.principal</name>
        <value>HTTP/_HOST@EXAMPLE.COM</value>
        <description> The HTTP Kerberos principal used by Hadoop-Auth in the HTTP endpoint.
        The HTTP Kerberos principal MUST start with 'HTTP/' per Kerberos HTTP
        SPNEGO specification.
        </description>
</property>

<property>
        <name>dfs.web.authentication.kerberos.keytab</name>
        <value>/etc/security/keytabs/spnego.service.keytab</value>
        <description>The Kerberos keytab file with the credentials for the HTTP
        Kerberos principal used by Hadoop-Auth in the HTTP endpoint.
        </description>
</property>

<property>
        <name>dfs.datanode.kerberos.principal</name>
        <value>dn/_HOST@EXAMPLE.COM</value>
        <description>
        The Kerberos principal that the DataNode runs as. "_HOST" is replaced by the real host name.
        </description>
</property>

<property>
        <name>dfs.namenode.keytab.file</name>
        <value>/etc/security/keytabs/nn.service.keytab</value>
        <description>
        Combined keytab file containing the namenode service and host
        principals.
        </description>
</property>

<property>
        <name>dfs.secondary.namenode.keytab.file</name>
        <value>/etc/security/keytabs/nn.service.keytab</value>
        <description>
        Combined keytab file containing the namenode service and host
        principals.
        </description>
</property>

<property>
        <name>dfs.datanode.keytab.file</name>
        <value>/etc/security/keytabs/dn.service.keytab</value>
        <description>
        The filename of the keytab file for the DataNode.
        </description>
</property>

<property>
        <name>dfs.https.port</name>
        <value>50470</value>
        <description>The https port where namenode
        binds</description>
</property>

<property>
        <name>dfs.https.address</name>
        <value>ip-10-111-59-170.ec2.internal:50470</value>
        <description>The https address where namenode binds</description>
</property>

<property>
        <name>dfs.datanode.data.dir.perm</name>
        <value>750</value>
        <description>The permissions that should be there on
        dfs.data.dir directories. The datanode will not come up if the
        permissions are different on existing dfs.data.dir directories. If
        the directories don't exist, they will be created with this
        permission.</description>
</property>

<property>
        <name>dfs.access.time.precision</name>
        <value>0</value>
        <description>The access time for HDFS file is precise upto this
        value.The default value is 1 hour. Setting a value of 0
        disables access times for HDFS.
        </description>
</property>

<property>
        <name>dfs.cluster.administrators</name>
        <value> hdfs</value>
        <description>ACL for who all can view the default
        servlets in the HDFS</description>
</property>

<property>
        <name>ipc.server.read.threadpool.size</name>
        <value>5</value>
        <description></description>
</property>

<property>
        <name>dfs.namenode.kerberos.internal.spnego.principal</name>
        <value>${dfs.web.authentication.kerberos.principal}</value>
</property>

<property>
        <name>dfs.secondary.namenode.kerberos.internal.spnego.principal</name>
        <value>${dfs.web.authentication.kerberos.principal}</value>
</property> 

In addition, you must set the user on all secure DataNodes:


export HADOOP_SECURE_DN_USER=hdfs
export HADOOP_SECURE_DN_PID_DIR=/grid/0/var/run/hadoop/$HADOOP_SECURE_DN_USER

yarn-site.xml

You must add the following information to the yarn-site.xml file on every host in your cluster:

Property

Value

Description

yarn.resourcemanager.principal

yarn/localhost@EXAMPLE.COM

The Kerberos principal for the ResourceManager.

yarn.resourcemanager.keytab

/etc/krb5.keytab

The keytab for the ResourceManager.

yarn.nodemanager.principal

yarn/localhost@EXAMPLE.COM

The Kerberos principal for the NodeManager.

yarn.nodemanager.keytab

/etc/krb5.keytab

The keytab for the NodeManager.

yarn.nodemanager.container-executor.class

org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor

The class that will execute (launch) the containers.

yarn.nodemanager.linux-container-executor.path

hadoop-3.0.0-SNAPSHOT/bin/container-executor

The path to the Linux container executor.

yarn.nodemanager.linux-container-executor.group

hadoop

A special group (e.g. hadoop) with executable permissions for the container executor, of which the NodeManager Unix user is the group member and no ordinary application user is. If any application user belongs to this special group, security will be compromised. This special group name should be specified for the configuration property.

yarn.timeline-service.principal

yarn/localhost@EXAMPLE.COM

The Kerberos principal for the Timeline Server.

yarn.timeline-service.keytab

/etc/krb5.keytab

The Kerberos keytab for the Timeline Server.

yarn.resourcemanager.webapp.delegation-token-auth-filter.enabled

true

Flag to enable override of the default Kerberos authentication filter with the RM authentication filter to allow authentication using delegation tokens (fallback to Kerberos if the tokens are missing). Only applicable when the http authentication type is Kerberos.

yarn.timeline-service.http-authentication.type

kerberos

Defines authentication used for the Timeline Server HTTP endpoint. Supported values are: simple | kerberos | $AUTHENTICATION_HANDLER_CLASSNAME

yarn.timeline-service.http-authentication.kerberos.principal

HTTP/localhost@EXAMPLE.COM

The Kerberos principal to be used for the Timeline Server HTTP endpoint.

yarn.timeline-service.http-authentication.kerberos.keytab

authentication.kerberos.keytab /etc/krb5.keytab

The Kerberos keytab to be used for the Timeline Server HTTP endpoint.

Following is the XML for these entries:




<property>
  <name>yarn.resourcemanager.principal</name>
  <value>yarn/localhost@EXAMPLE.COM</value>
</property>

<property>
  <name>yarn.resourcemanager.keytab</name>
  <value>/etc/krb5.keytab</value>
</property>

<property>
  <name>yarn.nodemanager.principal</name>
  <value>yarn/localhost@EXAMPLE.COM</value>
</property>

<property>
  <name>yarn.nodemanager.keytab</name>
  <value>/etc/krb5.keytab</value>
</property>

<property>
  <name>yarn.nodemanager.container-executor.class</name>
  <value>org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor</value>
</property>

<property>
  <name>yarn.nodemanager.linux-container-executor.path</name>
  <value>hadoop-3.0.0-SNAPSHOT/bin/container-executor</value>
</property>

<property>
  <name>yarn.nodemanager.linux-container-executor.group</name>
  <value>hadoop</value>
</property>

<property>
  <name>yarn.timeline-service.principal</name>
  <value>yarn/localhost@EXAMPLE.COM</value>
</property>

<property>
  <name>yarn.timeline-service.keytab</name>
  <value>/etc/krb5.keytab</value>
</property>

<property>
  <name>yarn.resourcemanager.webapp.delegation-token-auth-filter.enabled</name>
  <value>true</value>
</property>

<property>
  <name>yarn.timeline-service.http-authentication.type</name>
  <value>kerberos</value>
</property>

<property>
  <name>yarn.timeline-service.http-authentication.kerberos.principal</name>
  <value>HTTP/localhost@EXAMPLE.COM</value>
</property>

<property>
  <name>yarn.timeline-service.http-authentication.kerberos.keytab</name>
  <value>/etc/krb5.keytab</value>
</property>

mapred-site.xml

You must add the following information to the mapred-site.xml file on every host in your cluster:

Property Name

Property Value

Description

mapreduce.jobhistory.keytab

/etc/security/keytabs/jhs.service.keytab

Kerberos keytab file for the MapReduce JobHistory Server.

mapreduce.jobhistory.principal

jhs/_HOST@TODO-KERBEROS-DOMAIN

Kerberos principal name for the MapReduce JobHistory Server.

mapreduce.jobhistory.webapp.address

TODO-JOBHISTORYNODE-HOSTNAME:19888

MapReduce JobHistory Server Web UI host:port

mapreduce.jobhistory.webapp.https.address

TODO-JOBHISTORYNODE-HOSTNAME:19889

MapReduce JobHistory Server HTTPS Web UI host:port

mapreduce.jobhistory.webapp.spnego-keytab-file

/etc/security/keytabs/spnego.service.keytab

Kerberos keytab file for the spnego service.

mapreduce.jobhistory.webapp.spnego-principal

HTTP/_HOST@TODO-KERBEROS-DOMAIN

Kerberos principal name for the spnego service.

Following is the XML for these entries:


   <property>
      <name>mapreduce.jobhistory.keytab</name>
      <value>/etc/security/keytabs/jhs.service.keytab</value>
    </property>

    <property>
      <name>mapreduce.jobhistory.principal</name>
      <value>jhs/_HOST@TODO-KERBEROS-DOMAIN</value>
    </property>

    <property>
      <name>mapreduce.jobhistory.webapp.address</name>
      <value>TODO-JOBHISTORYNODE-HOSTNAME:19888</value>
    </property>

    <property>
      <name>mapreduce.jobhistory.webapp.https.address</name>
      <value>TODO-JOBHISTORYNODE-HOSTNAME:19889</value>
    </property>

    <property>
      <name>mapreduce.jobhistory.webapp.spnego-keytab-file</name>
      <value>/etc/security/keytabs/spnego.service.keytab</value>
    </property>

    <property>
      <name>mapreduce.jobhistory.webapp.spnego-principal</name>
      <value>HTTP/_HOST@TODO-KERBEROS-DOMAIN</value>
    </property> 

hbase-site.xml

For HBase to run on a secured cluster, HBase must be able to authenticate itself to HDFS. Add the following information to the hbase-site.xml file on your HBase server. There are no default values; the following are only examples:

Property Name

Property Value

Description

hbase.master.keytab.file

/etc/security/keytabs/hm.service.keytab

The keytab for the HMaster service principal

hbase.master.kerberos.principal

hm/_HOST@EXAMPLE.COM

The Kerberos principal name that should be used to run the HMaster process. If _HOST is used as the hostname portion, it will be replaced with the actual hostname of the running instance.

hbase.regionserver.keytab.file

/etc/security/keytabs/rs.service.keytab

The keytab for the HRegionServer service principal

hbase.regionserver.kerberos.principal

rs/_HOST@EXAMPLE.COM

The Kerberos principal name that should be used to run the HRegionServer process. If _HOST is used as the hostname portion, it will be replaced with the actual hostname of the running instance.

hbase.superuser

hbase

Comma-separated List of users or groups that are allowed full privileges, regardless of stored ACLs, across the cluster. Only used when HBase security is enabled.

hbase.coprocessor.region.classes

Comma-separated list of Coprocessors that are loaded by default on all tables. For any override coprocessor method, these classes will be called in order. After implementing your own Coprocessor, just put it in HBase's classpath and add the fully qualified class name here. A coprocessor can also be loaded on demand by setting HTableDescriptor.

hbase.coprocessor.master.classes

Comma-separated list of org.apache.hadoop.hbase.coprocessor.MasterObserver coprocessors that are loaded by default on the active HMaster process. For any implemented coprocessor methods, the listed classes will be called in order. After implementing your own MasterObserver, just put it in HBase's classpath and add the fully qualified class name here.

Following is the XML for these entries:


<property>
        <name>hbase.master.keytab.file</name>
        <value>/etc/security/keytabs/hm.service.keytab</value>
        <description>Full path to the kerberos keytab file to use for logging
        in the configured HMaster server principal.
        </description>
</property>

<property>
        <name>hbase.master.kerberos.principal</name>
        <value>hm/_HOST@EXAMPLE.COM</value>
        <description>Ex. "hbase/_HOST@EXAMPLE.COM".
        The kerberos principal name that
        should be used to run the HMaster process.  The
        principal name should be in
        the form: user/hostname@DOMAIN.  If "_HOST" is used
        as the hostname portion, it will be replaced with the actual hostname of the running
        instance.
        </description>
</property>

<property>
        <name>hbase.regionserver.keytab.file</name>
        <value>/etc/security/keytabs/rs.service.keytab</value>
        <description>Full path to the kerberos keytab file to use for logging
        in the configured HRegionServer server principal.
        </description>
</property>

<property>
        <name>hbase.regionserver.kerberos.principal</name>
        <value>rs/_HOST@EXAMPLE.COM</value>
        <description>Ex. "hbase/_HOST@EXAMPLE.COM".
        The kerberos principal name that
        should be used to run the HRegionServer process. The
        principal name should be in the form:
        user/hostname@DOMAIN.  If _HOST
        is used as the hostname portion, it will be replaced
        with the actual hostname of the running
        instance.  An entry for this principal must exist
        in the file specified in hbase.regionserver.keytab.file
        </description>
</property>

<!--Additional configuration specific to HBase security -->

<property>
        <name>hbase.superuser</name>
        <value>hbase</value>
        <description>List of users or groups (comma-separated), who are
        allowed full privileges, regardless of stored ACLs, across the cluster. Only
        used when HBase security is enabled.
        </description>
</property>

<property>
        <name>hbase.coprocessor.region.classes</name>
        <value></value>
        <description>A comma-separated list of Coprocessors that are loaded
        by default on all tables. For any override coprocessor method, these classes         will be called in order. After implementing your own Coprocessor,
        just put it in HBase's classpath and add the fully qualified class name here.         A coprocessor can also be loaded on demand by setting HTableDescriptor.
        </description>
</property>

<property>
        <name>hbase.coprocessor.master.classes</name>
        <value></value>
        <description>A comma-separated list of
        org.apache.hadoop.hbase.coprocessor.MasterObserver coprocessors that
        are loaded by default on the active HMaster process. For any implemented
        coprocessor methods, the listed classes will be called in order.
        After implementing your own MasterObserver, just put it in HBase's
        classpath and add the fully qualified class name here.
        </description>
</property> 

hive-site.xml

Hive Metastore supports Kerberos authentication for Thrift clients only. HiveServer does not support Kerberos authentication for any clients.

To the hive-site.xml file on every host in your cluster, add the following information:

Property Name

Property Value

Description

hive.metastore.sasl.enabled

true

If true, the Metastore Thrift interface will be secured with SASL and clients must authenticate with Kerberos

hive.metastore.kerberos.keytab.file

/etc/security/keytabs/hive.service.keytab

The keytab for the Metastore Thrift service principal

hive.metastore.kerberos.principal

hive/_HOST@EXAMPLE.COM

The service principal for the Metastore Thrift server. If _HOST is used as the hostname portion, it will be replaced with the actual hostname of the running instance.

hive.metastore.cache.pinobjtypes

Table,Database,Type,FieldSchema,Order

Comma-separated Metastore object types that should be pinned in the cache

Following is the XML for these entries:


<property>
        <name>hive.metastore.sasl.enabled</name>
        <value>true</value>
        <description>If true, the metastore thrift interface will be secured with
        SASL.
        Clients must authenticate with Kerberos.</description>
</property>

<property>
        <name>hive.metastore.kerberos.keytab.file</name>
        <value>/etc/security/keytabs/hive.service.keytab</value>
        <description>The path to the Kerberos Keytab file containing the
        metastore thrift server's service principal.</description>
</property>

<property>
        <name>hive.metastore.kerberos.principal</name>
        <value>hive/_HOST@EXAMPLE.COM</value>
        <description>The service principal for the metastore thrift server. The
        special string _HOST will be replaced automatically with the correct
        hostname.</description>
</property>

<property>
        <name>hive.metastore.cache.pinobjtypes</name>
        <value>Table,Database,Type,FieldSchema,Order</value>
        <description>List of comma separated metastore object types that should be pinned in
        the cache</description>
</property>

oozie-site.xml

To the oozie-site.xml file, add the following information:

Property Name

Property Value

Description

oozie.service.AuthorizationService.security.enabled

true

Specifies whether security (user name/admin role) is enabled or not. If it is disabled any user can manage the Oozie system and manage any job.

oozie.service.HadoopAccessorService.kerberos.enabled

true

Indicates if Oozie is configured to use Kerberos

local.realm

EXAMPLE.COM

Kerberos Realm used by Oozie and Hadoop. Using local.realm to be aligned with Hadoop configuration.

oozie.service.HadoopAccessorService.keytab.file

/etc/security/keytabs/oozie.service.keytab

The keytab for the Oozie service principal.

oozie.service.HadoopAccessorService.kerberos.principaloozie/_HOSTl@EXAMPLE.COM

oozie/_HOSTl@EXAMPLE.COM

Kerberos principal for Oozie service

oozie.authentication.type

kerberos

oozie.authentication.kerberos.principal

HTTP/_HOST@EXAMPLE.COM

Whitelisted job tracker for Oozie service

oozie.authentication.kerberos.keytab

/etc/security/keytabs/spnego.service.keytab

Location of the Oozie user keytab file.

oozie.service.HadoopAccessorService.nameNode.whitelist

oozie.authentication.kerberos.name.rules

RULE:[2:$1@$0]([jt]t@.*EXAMPLE.COM)s/.*/mapred/ RULE:[2:$1@$0]([nd]n@.*EXAMPLE.COM)s/.*/hdfs/ RULE:[2:$1@$0](hm@.*EXAMPLE.COM)s/.*/hbase/ RULE:[2:$1@$0](rs@.*EXAMPLE.COM)s/.*/hbase/ DEFAULT

The mapping from Kerberos principal names to local OS user names. See Creating Mappings Between Principals and UNIX Usernames for more information.

oozie.service.ProxyUserService.proxyuser.knox.groups

users

Grant proxy privileges to the knox user. Note only required when using a Knox Gateway.

oozie.service.ProxyUserService.proxyuser.knox.hosts

$knox_host_FQDN

Identifies the Knox Gateway. Note only required when using a Knox Gateway.

webhcat-site.xml

To the webhcat-site.xml file, add the following information:

Property Name

Property Value

Description

templeton.kerberos.principal

HTTP/_HOST@EXAMPLE.COM

templeton.kerberos.keytab

/etc/security/keytabs/spnego.service.keytab

templeton.kerberos.secret

secret

hadoop.proxyuser.knox.groups

users

Grant proxy privileges to the knox user. Note only required when using a Knox Gateway.

hadoop.proxyuser.knox.hosts

$knox_host_FQDN

Identifies the Knox Gateway. Note only required when using a Knox Gateway.

limits.conf

Adjust the Maximum Number of Open Files and Processes

In a secure cluster, if the DataNodes are started as the root user, JSVC downgrades the processing using setuid to hdfs. However, the ulimit is based on the ulimit of the root user, and the default ulimit values assigned to the root user for the maximum number of open files and processes may be too low for a secure cluster. This can result in a “Too Many Open Files” exception when the DataNodes are started.

Therefore, when configuring a secure cluster you should increase the following root ulimit values:

  • nofile - the maximum number of open files. Recommended value: 32768

  • nproc - the maximum number of processes. Recommended value: 65536

To set system-wide ulimits to these values, log in as root and add the following lines to the the /etc/security/limits.conf file on every host in your cluster:

*               -    nofile            32768
*               -    nproc             65536

To set only the root user ulimits to these values, log in as root and add the following lines to the the /etc/security/limits.conf file.

root               -    nofile            32768
root               -    nproc             65536

You can use the ulimit -a command to view the current settings:

[root@node-1 /]# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 14874
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 14874
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

You can also use the ulimit command to dynamically set these limits until the next reboot. This method sets a temporary value that will revert to the settings in the /etc/security/limits.conf file after the next reboot, but it is useful for experimenting with limit settings. For example:

[root@node-1 /]# ulimit -n 32768

The updated value can then be displayed:

[root@node-1 /]# ulimit -n
32768

Configuring Hue

To enable Hue to work with a PHD cluster configured for Kerberos, make the following changes to Hue and Kerberos.

  • Configure Kerberos as described in Setting Up Security for Manual Installs.

  • Create a principal for the Hue Server.

    addprinc -randkey hue/$FQDN@EXAMPLE.COM

    where $FQDN is the hostname of the Hue Server and EXAMPLE.COM is the Hadoop realm.

  • Generate a keytab for the Hue principal.

    xst -k hue.service.keytab hue/$FQDN@EXAMPLE.COM

  • Place the keytab file on the Hue Server. Set the permissions and ownership of the keytab file.

    /etc/security/keytabs/hue.service.keytab chown hue:hadoop /etc/security/keytabs/hue.service.keytab chmod 600 /etc/security/keytabs/hue.service.keytab

  • Confirm the keytab is accessible by testing kinit.

    su - hue kinit -k -t /etc/security/keytabs/hue.service.keytab hue/$FQDN@EXAMPLE.COM

  • Add the following to the [kerberos] section in the /etc/hue/conf/hue.ini configuration file.

    [[kerberos]] # Path to Hue's Kerberos keytab file hue_keytab=/etc/security/keytabs/hue.service.keytab # Kerberos principal name for Hue hue_principal=hue/$FQDN@EXAMPLE.COM

  • Set the path to the kinit based on the OS.

    # Path to kinit # For RHEL/CentOS 6.x, kinit_path is /usr/bin/kinit kinit_path=/usr/kerberos/bin/kinit

  • Set security_enabled=true for every component in hue.ini.

    [[hdfs_clusters]], [[yarn_clusters]], [[liboozie]], [[hcatalog]]

  • Save the hue.ini file.

  • Restart Hue:

    # /etc/init.d/hue start

Setting up One-Way Trust with Active Directory

In environments where users from Active Directory (AD) need to access Hadoop Services, set up one-way trust between Hadoop Kerberos realm and the AD (Active Directory) domain.

Configure Kerberos Hadoop Realm on the AD DC

Configure the Hadoop realm on the AD DC server and set up the one-way trust.

  • Add the Hadoop Kerberos realm and KDC host to the DC:

    ksetup /addkdc $hadoop.realm $KDC-host

  • Establish one-way trust between the AD domain and the Hadoop realm:

    netdom trust$hadoop.realm/Domain:$AD.domain/add/realm /passwordt:$trust_password

  • (Optional) If Windows clients within the AD domain need to access Hadoop Services, and the domain does not have a search route to find the services in Hadoop realm, run the following command to create a hostmap for Hadoop service host:

    ksetup /addhosttorealmmap $hadoop-service-host $hadoop.realm

  • (Optional) define the encryption type:

    ksetup /SetEncTypeAttr $hadoop.realm $encryption_type

    Set encryption types based on your security requirements. Mismatched encryption types cause problems.

Configure the AD Domain on the KDC and Hadoop Cluster Hosts

Add the AD domain as a realm to the krb5.conf on the Hadoop cluster hosts. Optionally configure encryption types and UDP preferences.

  • Open the krb5.conf file with a text editor and make the following changes:

    • To libdefaults, add the following properties.

      • Set the Hadoop realm as default:

        [libdefaults]default_domain = $hadoop.realm

      • Set the encryption type:

        [libdefaults]default_tkt_enctypes = $encryption_types default_tgs_enctypes = $encryption_types permitted_enctypes = $encryption_types

        where the $encryption_types match the type supported by your environment. For example:

        default_tkt_enctypes = aes256-cts aes128-cts rc4-hmac arcfour-hmac-md5 des-cbc-md5 des-cbc-crc
        default_tgs_enctypes = aes256-cts aes128-cts rc4-hmac arcfour-hmac-md5 des-cbc-md5 des-cbc-crc
        permitted_enctypes = aes256-cts aes128-cts rc4-hmac arcfour-hmac-md5 des- cbc-md5 des-cbc-crc

      • If TCP is open on the KDC and AD Server:

        [libdefaults] udp_preference_limit = 1

    • Add a realm for the AD domain:

      [realms]
      $AD.DOMAIN = {
        kdc = $AD-host-FQDN
        admin_server = $AD-host-FQDN
        default_domain = $AD-host-FQDN
      }

    • Save the krb5.conf changes to all Hadoop Cluster hosts.

  • Add the trust principal for the AD domain to the Hadoop MIT KDC:

    kadmin
    kadmin:addprinc krbtgt/$hadoop.realm@$AD.domain

    This command will prompt you for the trust password. Use the same password as the earlier step.

Uninstalling PHD

    Use the following instructions to uninstall PHD:
  • Stop all of the installed PHD services.

  • If Knox is installed, run the following command on all the cluster nodes:

    • For RHEL/CentOS Linux:

      yum remove knox*

    • For SLES:

      zypper remove knox\*

  • If Ranger is installed, execute the following command on all the cluster nodes:

    • For RHEL/CentOS Linux:

      yum remove ranger\*

    • For SLES:

      zypper remove ranger\*

  • If Hive is installed, execute the following command on all the cluster nodes:

    • For RHEL/CentOS Linux:

      yum remove hive\*

    • For SLES:

      zypper remove hive\*

  • If HBase is installed, execute the following command on all the cluster nodes:

    • For RHEL/CentOS Linux:

      yum remove hbase\*

    • For SLES:

      zypper remove hbase\*

  • If Tez is installed, execute the following command on all the cluster nodes:

    • For RHEL/CentOS Linux:

      yum remove tez\*

    • For SLES:

      zypper remove tez\*

  • If ZooKeeper is installed, execute the following command on all the cluster nodes:

    • For RHEL/CentOS Linux:

      yum remove zookeeper\*

    • For SLES:

      zypper remove zookeeper\*

  • If Oozie is installed, execute the following command on all the cluster nodes:

    • For RHEL/CentOS Linux:

      yum remove oozie\*

    • For SLES:

      zypper remove oozie\*

  • If Pig is installed, execute the following command on all the cluster nodes:

    • For RHEL/CentOS Linux:

      yum remove pig\*

    • For SLES:

      zypper remove pig\*

  • If compression libraries are installed, execute the following command on all the cluster nodes:

    yum remove snappy\* yum remove hadooplzo\*

  • If Knox is installed, execute the following command on all the gateway host:

    • For RHEL/CentOS Linux:

      yum remove knox\*

    • For SLES:

      zypper remove knox\*

  • Uninstall Hadoop. Run the following command on all the cluster nodes:

    yum remove hadoop\*

  • Uninstall ExtJS libraries and MySQL connector. Run the following command on all the cluster nodes:

    yum remove extjs-2.2-1 mysql-connector-java-5.0.8-1\*