Data Services

Data Services

PHD 3.0

PHD Data Services

Using Apache Hive

Pivotal HD deploys Apache Hive for your Hadoop cluster.

Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability for querying and analysis of large data sets stored in Hadoop files.

Hive defines a simple SQL query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.

In this document:

Hive Documentation

Documentation for Hive can be found in wiki docs and javadocs.

New Feature: Temporary Tables

Hive 0.14 introduces support for temporary tables. A temporary table is a convenient way for an application to automatically manage intermediate data generated during a complex query. Rather than manually deleting tables needed only as temporary data in a complex query, Hive automatically deletes all temporary tables at the end of the Hive session in which they are created.The data in these tables is stored in the user's scratch directory rather than in the Hive warehouse directory. The scratch directory effectively acts as the user' data sandbox, located by default in /tmp/hive-<username>.

Hive users create temporary tables using the TEMPORARY keyword:

CREATE TEMPORARY TABLE tmp1 (c1 string);
CREATE TEMPORARY TABLE tmp2 AS ...
CREATE TEMPORARY TABLE tmp3 LIKE ...

Multiple Hive users can create multiple Hive temporary tables with the same name because each table resides in a separate session.

Temporary tables support most table options, but not all. The following features are not supported:

  • Partition columns

  • Indexes

New Feature: Cost-based SQL Optimization

Hive 0.13.0 introduced cost-based optimization, or CBO, of SQL queries. CBO uses statistics about Hive tables, table partitions, and columns within a table to produce good query execution plans. More efficient query plans better utilize cluster resources and improve query latency. CBO is most useful for complex queries containing multiple JOIN statements and for queries on very large tables.

CBO currently generates the following statistics:

Statistics Granularity

Description

Table-level

* Uncompressed size of table * Number of rows * Number of files

Column-level

* Number of distinct values * Number of NULL values * Minimum value * Maximum value

CBO requires column-level statistics to generate the best query execution plans. Later, when viewing these statistics from the command line, you can can choose to also include table-level statistics generated by the hive.stats.autogather configuration property. However, CBO does not use these table-level statistics to generate query execution plans.

Enabling Cost-based SQL Optimization

Pivotal recommends that administrators always enable CBO. Set and verify the following configuration parameters in hive-site.xml to enable cost-based optimization of SQL queries:

CBO Configuration Parameters

CBO Configuration Parameter

Description

Default Value

hive.cbo.enable

Enables cost-based query optimization.

False

hive.stats.autogather

Enables automated gathering of table-level statistics for newly created tables and table partitions, such as tables created with the INSERT OVERWRITE statement. The parameter does not produce column-level statistics, such as those generated by CBO. If disabled, administrators must manually generate these table-level statistics with the ANALYZE TABLE statement.

True

The following configuration properties are not specific to CBO, but setting them to true will also improve the performance of queries that generate statistics:

Configuration Parameter

Description

Default Value

hive.stats.fetch.column.stats

Instructs Hive to collect column-level statistics.

False

hive.compute.query.using.stats

Instructs Hive to use statistics when generating query plans.

False

Generating Statistics

Use the ANALYZE TABLE command to generate statistics for tables and columns. Use the optional NO SCAN clause to improve query performance by preventing a scan of files on HDFS. This option gathers only the following statistics:

  • Number of files

  • Size of files in bytes

ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] COMPUTE STATISTICS [NO SCAN];

The following example views statistics for all partitions in the employees table. The query also uses the NO SCAN clause to improve performance:

ANALYZE TABLE employees PARTITION (dt) COMPUTE STATISTICS

Generating Column-level Statistics Use the following syntax to generate statistics for columns in the employee table:

ANALYZE TABLE tablename [PARTITION(partcol1[1=val1], partcol2[=val2], ...)] COMPUTE STATISTICS FOR COLUMNS [NO SCAN];

The following example generates statistics for all column in the employees table:

ANALYZE TABLE employees PARTITION (dt) COMPUTE STATISTICS FOR COLUMNS;
Viewing Statistics

Use the DESCRIBE statement to view statistics generated by CBO. Include the EXTENDED keyword if you want to include statistics gathered when the hive.stats.fetch.column.stats and hive.compute.query.using.stats properties are enabled.

Viewing Generated Table Statistics

Use the following syntax to generate table statistics:

DESCRIBE [EXTENDED] tablename;

The following example displays all statistics for the employees table:

DESCRIBE EXTENDED employees;

Viewing Generated Column Statistics

Use the following syntax to generate column statistics:

DESCRIBE FORMATTED [dbname.]tablename.columnname;

The following example displays statistics for the region column in the employees table:

DESCRIBE FORMATTED employees.region;

New Feature: ORC Format Improvement

Hive 0.14.0 introduces the ability to create ORC-based tables with Hive that contain more than 1000 columns.

Streaming Data Ingestion

Limitations

Hive 0.13 and 0.14 have the following limitations to ingesting streaming data:

  • Only ORC files are supported

  • Destination tables must be bucketed

Query Vectorization

Vectorization allows Hive to process a batch of rows together instead of processing one row at a time. Each batch consists of awhich is usually an array of primitive types. Operations are performed on the entire column vector, which improves the instruction pipelines and cache usage. HIVE-4160 has the design document for vectorization and tracks the implementation of many subtasks.

Enable Vectorization in Hive

To enable vectorization, set this configuration parameter:

  • hive.vectorized.execution.enabled=true

When vectorization is enabled, Hive examines the query and the data to determine whether vectorization can be supported. If it cannot be supported, Hive will execute the query with vectorization turned off.

Log Information about Vectorized Execution of Queries

The Hive client will log, at the info level, whether a query's execution is being vectorized. More detailed logs are printed at the debug level.

The client logs can also be configured to show up on the console.

Supported Functionality

The current implementation supports only single table read-only queries. DDL queries or DML queries are not supported.

The supported operators are selection, filter and group by.

Partitioned tables are supported.

These data types are supported:

  • tinyint

  • smallint

  • int

  • bigint

  • boolean

  • float

  • double

  • timestamp

  • string

  • char

  • varchar

  • binary

These expressions are supported:

  • Comparison: >, >=, <, <=, =, !=

  • Arithmetic: plus, minus, multiply, divide, modulo

  • Logical: AND, OR

  • Aggregates: sum, avg, count, min, max

Only the ORC file format is supported in the current implementation.

Unsupported Functionality

All datatypes, file formats, and functionality are currently unsupported.

Two unsupported features of particular interest are the logical expression NOT and the cast operator. For example, a query such as select x,y from T where a = b will not vectorize if a is integer and b is double. Although both int and double are supported, casting of one to another is not supported.

Comparing Beeline to the Hive CLI

PHD supports two Hive clients: the Hive CLI and Beeline. The primary difference between the two involves how the clients connect to Hive. The Hive CLI connects directly to HDFS and the Hive Metastore and can be used only on a host with access to those services. Beeline connects to HiveServer2 and requires access to only one .jar file: hive-jdbc-<version>-standalone.jar. Pivotal recommends that using HiveServer2 and a JDBC client such as Beeline as the primary way to access Hive because this route uses SQL standard-based authorization or Ranger-based authorization. However, some users may wish to access Hive data from other applications, such as Pig. For these use cases, use the Hive CLI and storage-based authorization.

Beeline Operating Modes and HiveServer2 Transport Modes

Beeline supports the following modes of operation:

Beeline Modes of Operation

Operating Mode

Description

Embedded

The Beeline client and the Hive installation both reside on the same host machine. No TCP connectivity is required.

Remote

Use remote mode to support multiple, concurrent clients executing queries against the same remote Hive installation. Remote transport mode supports authentication with LDAP and Kerberos. It also supports encryption with SSL. TCP connectivity is required.

Administrators may start HiveServer2 in one of the following transport modes:

HiveServer2 Transport Modes

Transport Mode

Description

TCP

HiveServer2 uses TCP transport for sending and receiving Thrift RPC messages.

HTTP

HiveServer2 uses HTTP transport for sending and receiving Thrift RPC messages.

While running in TCP transport mode, HiveServer2 supports the following authentication schemes:

Authentication Schemes with TCP Transport Mode

Authentication Scheme

Description

Kerberos

A network authentication protocol which operates that uses the concept of 'tickets' to allow nodes in a network to securely identify themselves. Administrators must specify hive.server2.authentication=kerberos in the hive-site.xml configuration file to use this authentication scheme.

LDAP

The Lightweight Directory Access Protocol, an application-layer protocol that uses the concept of 'directory services' to share information across a network. Administrators must specify hive.server2.authentication=ldap in the hive-site.xml configuration file to use this type of authentication.

PAM

Pluggable Authentication Modules, or PAM, allow administrators to integrate multiple authentication schemes into a single API. Administrators must specify hive.server2.authentication=pam in the hive-site.xml configuration file to use this authentication scheme.

Custom

Authentication provided by a custom implementation of the org.apache.hive.service.auth.PasswdAuthenticationProvider interface. The implementing class must be available in the classpath for HiveServer2 and its name provided as the value of the hive.server2.custom.authentication.class property in the hive-site.xml configuration property file.

None

The Beeline client performs no authentication with HiveServer2. Administrators must specify hive.server2.authentication=none in the hive-site.xml configuration file to use this authentication scheme.

NoSASL

While running in TCP transport mode, HiveServer2 uses the Java Simple Authentication and Security Layer (SASL) protocol to establish a security layer between the client and server. However, HiveServer2 also supports connections in TCP transfer mode that do not use the SASL protocol Administrators must specify hive.server2.authentication=nosasl in the hive-site.xml configuration file to use this authentication scheme.

The next section describes the connection strings used to connect to HiveServer2 for all possible combinations of these modes, as well as the connection string required to connect to HiveServer2 in a secure cluster.

Connecting to Hive with Beeline

The following examples demonstrate how to use Beeline to connect to Hive for all possible variations of these modes:

Embedded Client

Use the following syntax to connect to Hive from Beeline in embedded mode:

!connect jdbc:hive2://

Remote Client with HiveServer2 TCP Transport Mode and SASL Authentication

Use the following syntax to connect to HiveServer2 in TCP mode from a remote Beeline client:

!connect jdbc:hive2://<host>:<port>/<db>

The default port for HiveServer2 in TCP mode is 10000. db is the name of the database to which you want to connect.

Remote Client with HiveServer2 TCP Transport Mode and NoSASL Authentication

Clients must explicitly specify the authentication mode in their connection string when HiveServer2 runs in NoSASL mode:

!connect jdbc:hive2://<host>:<port>/<db>;auth=noSasl hiveuser pass org.apache.hive.jdbc.HiveDriver

If users forget to include auth=noSasl in the JDBC connection string, the JDBC client API attempts to make an SASL connection to HiveServer2. This causes an open connection that eventually results in the client crashing with an Out Of Memory error.

Remote Client with HiveServer2 HTTP Transport Mode

Use the following syntax to connect to HiveServer2 in HTTP mode from a remote Beeline client:

!connect jdbc:hive2://<host>:<port>/<db>?hive.server2.transport.mode=http;hive.server2.thrift.http.path=<http_endpoint>

Remote Client with HiveServer2 in Secure Cluster

Use the following syntax to connect to HiveServer2 in a secure cluster from a remote Beeline client:

!connect jdbc:hive2://<host>:<port>/<db>;principal=<Server_Principal_of_HiveServer2>

Troubleshooting Hive

MySQL is the default database used by the Hive metastore. Depending on several factors, such as the version and configuration of MySQL, Hive developers may encounter an error message similar to the following:

An exception was thrown while adding/validating classes) : Specified key was too long; max key length is 767 bytes

Administrators can resolve this issue by altering the Hive metastore database to use the Latin1 character set, as shown in the following example:

mysql> ALTER DATABASE <metastore_database_name> character set latin1;

Hive JIRAs

Issue tracking for Hive bugs and improvements can be found here: Hive JIRAs.

SQL Compliance

This chapter discusses the ongoing implementation of standard SQL syntax in Hive. Although SQL in Hive does not yet entirely support the SQL-2011 standard, versions 0.13 and 0.14 provide significant improvements to the parity between SQL as used in Hive and SQL as used in traditional relational databases.

Hive 0.13 Feature: SQL Standard-based Authorization with GRANT And REVOKE SQL Statements

Hive 0.13 provides secure SQL standard-based authorization using the GRANT and REVOKE SQL statements. Hive provides three authorization models: SQL standard-based authorization, storage-based authorization, and default Hive authorization. In addition, Ranger provides centralized management of authorization for all PHD components. Use the following procedure to manually enable standard SQL authorization:

  • Set the following configuration parameters in hive-site.xml :

    Configuration Parameters for Standard SQL Authorization

    Configuration Parameter

    Required Value

    hive.server2.enable.doAs

    false

    hive.users.in.admin.role

    Comma-separated list of users granted the administrator role.

  • Start HiveServer2 with the following command-line options:

    HiveServer2 Command-Line Options

    Command-Line Option

    Required Value

    -hiveconf hive.security.authorization.manager

    org.apache.hadoop.hive.ql.security.authorization.MetaStoreAuthzAPIAuthorizerEmbedOnly

    -hiveconf hive.security.authorization.enabled

    true

    -hiveconf hive.security.authenticator.manager

    org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator

    -hiveconf hive.metastore.uris

    ' ' (a space inside single quotation marks)

New Feature: INSERT ... VALUES, UPDATE, and DELETE SQL Statements

Apache Hive 0.14 provides new INSERT ... VALUES, UPDATE, and DELETE SQL statements. The INSERT ... VALUES statement allows users to write data to Apache Hive from values provided in SQL statements. The UPDATE and DELETE statements allows users to modify and delete values already written to Hive. All three statements support auto-commit, meaning that each statement is a separate transaction that is automatically committed after the SQL statement is executed.

The INSERT ... VALUES, UPDATE, and DELETE statements require the following property values in the hive-site.xml configuration file:

Configuration Property

Required Value

hive.enforce.bucketing

true

hive.exec.dynamic.partition.mode

nonstrict

INSERT ... VALUES Statement

The INSERT ... VALUES statement is revised to support adding multiple values into table columns directly from SQL statements. A valid INSERT ... VALUES statement must provide values for each column in the table. However, users may assign null values to columns for which they do not want to assign a value. In addition, the PARTITION clause must be included in the DML.

INSERT INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] VALUES values_row [, values_row...]

In this syntax, values_row is (value [, value]) and where value is either NULL or any SQL literal.

The following example SQL statements demonstrate several usage variations of this statement:

CREATE TABLE students (name VARCHAR(64), age INT, gpa DECIMAL(3,2)) CLUSTERED BY (age) INTO 2 BUCKETS STORED AS ORC;
INSERT INTO TABLE students VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);
CREATE TABLE pageviews (userid VARCHAR(64), link STRING, from STRING PARTITIONED BY (datestamp STRING) CLUSTERED BY (userid) INTO 256 BUCKETS STORED AS ORC;
INSERT INTO TABLE pageviews PARTITION (datestamp = '2014-09-23') VALUES ('jsmith', 'mail.com', 'sports.com'), ('jdoe', 'mail.com', null);
INSERT INTO TABLE pageviews PARTITION (datestamp) VALUES ('tjohnson', 'sports.com', 'finance.com', '2014-09-23'), ('tlee', 'finance.com', null, '2014-09-21');
UPDATE Statement

Use the UPDATE statement to modify data already written to Apache Hive. Depending on the condition specified in the optional WHERE clause, an UPDATE statement may affect every row in a table. You must have both the SELECT and UPDATE privileges to use this statement.

UPDATE tablename SET column = value [, column = value ...] [WHERE expression];

The UPDATE statement has the following limitations:

  • The expression in the WHERE clause must be an expression supported by a Hive SELECT clause.

  • Partition and bucket columns cannot be updated.

  • Query vectorization is automatically disabled for UPDATE statements. However, updated tables can still be queried using vectorization.

  • Subqueries are not allowed on the right side of the SET statement.

The following example demonstrates the correct usage of this statement:

UPDATE students SET name = null WHERE gpa <= 1.0;
DELETE Statement

Use the DELETE statement to delete data already written to Apache Hive.

DELETE FROM tablename [WHERE expression];

The DELETE statement has the following limitation: query vectorization is automatically disabled for the DELETE operation. However, tables with deleted data can still be queried using vectorization.

The following example demonstrates the correct usage of this statement:

DELETE FROM students WHERE gpa <= 1,0;

Hive 0.13 Feature: Transactions

Support for transactions in Hive 0.13 enables SQL atomicity of operations at the row level rather than at the level of a table or partition. This allows a Hive client to read from a partition at the same time that another Hive client is adding rows to the same partition. In addition, transactions provide a mechanism for streaming clients to rapidly update Hive tables and partitions. Hive transactions differ from RDBMS transactions in that each transaction has an identifier, and multiple transactions are grouped into a single transaction batch. A streaming client requests a set of transaction IDs after connecting to Hive and subsequently uses these transaction IDs one at a time during the initialization of new transaction batches. Clients write one or more records for each transaction and either commit or abort a transaction before moving to the next transaction.

ACID is an acronym for four traits of database transactions: atomicity, consistency, isolation, and durability.

Transaction Attribute

Description

Atomicity

An operation either succeeds completely or fails; it does not leave partial data.

Consistency

Once an application performs an operation, the results of that operation are visible to the application in every subsequent operation.

Isolation

Operations by one user do not cause unexpected side effects for other users.

Durability

Once an operation is complete, it is preserved in case of machine or system failure.

By default, transactions are disabled in Hive. To use ACID-based transactions, administrators must use a transaction manager that supports ACID and the ORC file format. See Configuring the Hive Transaction Manager later in this topic for instructions on configuring a transaction manager for Hive.

Understanding Compactions

Hive stores data in base files that cannot be updated by HDFS. Instead, Hive creates a set of delta files for each transaction that alters a table or partition and stores them in a separate delta directory. Occasionally, Hive compacts, or merges, the base and delta files. Hive performs all compactions in the background without affecting concurrent reads and writes of Hive clients. There are two types of compactions:

Hive Compaction Types

Compaction Type

Description

Minor

Rewrites a set of delta files to a single delta file for a bucket.

Major

Rewrites one or more delta files and the base file as a new base file for a bucket.

By default, Hive automatically compacts delta and base files at regular intervals. However, Hadoop administrators can configure automatic compactions, as well as perform manual compactions of base and delta files using the following configuration parameters in hive-site.xml.

Hive Transaction Configuration Parameters

Configuration Parameter

Description

hive.txn.manager

Specifies the class name of the transaction manager used by Hive. Set this property to org.apache.hadoop.hive.ql.lockmgr.DbTxnManager to enable transactions. The default value is org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager, which disables transactions.

hive.compactor.initiator.on

Specifies whether to run the initiator and cleaner threads on this Metastore instance. The default value is false. Must be set to true for exactly one instance of the Hive metastore service.

hive.compactor.worker.threads

Specifies the number of of worker threads to run on this Metastore instance. The default value is 0, which must be set to greater than 0 to enable compactions. Worker threads initialize MapReduce jobs to do compactions. Increasing the number of worker threads decreases the time required to compact tables after they cross a threshold that triggers compactions. However, increasing the number of worker threads also increases the background load on a Hadoop cluster.

hive.compactor.worker.timeout

Specifies the time period, in seconds, after which a compaction job is failed and re-queued. The default value is 86400 seconds, or 24 hours.

hive.compactor.check.interval

Specifies the time period, in seconds, between checks to see if any partitions require compacting. The default value is 300 seconds. Decreasing this value reduces the time required to start a compaction for a table or partition. However, it also increases the background load on the NameNode since each check requires several calls to the NameNode.

hive.compactor.delta.num.threshold

Specifies the number of delta directories in a partition that triggers an automatic minor compaction. The default value is 10.

hive.compactor.delta.pct.threshold

Specifies the percentage size of delta files relative to the corresponding base files that triggers an automatic major compaction. The default value is.1, which is 10 percent.

hive.compactor.abortedtxn.threshold

Specifies the number of aborted transactions on a single partition that trigger an automatic major compaction.

Configuring the Hive Transaction Manager

Configure a the following Hive configuration properties from the table above to enable transactions:

  • hive.txn.manager

  • hive.compactor.initiator.on

  • hive.compactor.worker.threads

Check the hive log file at /tmp/hive/hive.log for errors if you experience problems enabling hive transactions.

Performing Manual Compactions

Hive administrators use the ALTER TABLE DLL command to queue a request to compact base and delta files for a table or partition:

ALTER TABLE tablename [PARTITION (partition_key='partition_value' [,...])] COMPACT 'compaction_type'

Use the SHOW COMPACTIONS DDL command to monitor the progress of the compactions:

SHOW COMPACTIONS

This command provides the following output for each compaction:

  • Database name

  • Table name

  • Partition name

  • Major or minor compaction

  • Compaction state:

    • Initiated - waiting in queue

    • Working - currently compacting

    • Ready for cleaning - compaction completed and old files scheduled for removal

  • Thread ID

  • Start time of compaction

Hive administrators can also view a list of currently open and aborted transactions with the the SHOW TRANSACTIONS DDL command:

SHOW TRANSACTIONS

This command provides the following output for each transaction:

  • Transaction ID

  • Transaction state

  • Hive user who initiated the transaction

  • Host machine where transaction was initiated

New Lock Manager

Hive 0.13 introduced a new lock manager, DbLockManager, to store all transaction and related lock information in the Hive Metastore. Heartbeats are sent regularly from lock holders and transaction initiators to the Hive metastore to prevent stale locks and transactions. The lock or transaction is aborted if the Metastore does not receive a heartbeat within the amount of time specified by the hive.txn.timeout configuration property. Hive administrators use the SHOW LOCKS DDL command to view information about locks associated with transactions.

This command provides the following output for each lock:

  • Database name

  • Table name

  • Partition, if the table is partitioned

  • Lock state:

    • Acquired - transaction initiator hold the lock

    • Waiting - transaction initiator is waiting for the lock

    • Aborted - the lock has timed out but has not yet been cleaned

  • Lock type:

    • Exclusive - the lock may not be shared

    • Shared_read - the lock may be shared with any number of other shared_read locks

    • Shared_write - the lock may be shared by any number of other shared_read locks but not with other shared_write locks

  • Transaction ID associated with the lock, if one exists

  • Last time lock holder sent a heartbeat

  • Time the lock was acquired, if it has been acquired

  • Hive user who requested the lock

  • Host machine on which the Hive user is running a Hive client

Transaction Limitations

PHD currently has the following limitations for ACID-based transactions in Hive:

  • The BEGIN, COMMIT, and ROLLBACK SQL statements are not yet supported. All operations are automatically committed as transactions.

  • The user initiating the Hive session must have write permission for the destination partition or table.

  • Zookeeper and in-memory locks are not compatible with transactions.

  • Only ORC files are supported.

  • Destination tables must be bucketed and not sorted.

  • Snapshot-level isolation, similar to READ COMMITTED. A query is provided with a consistent snapshot of the data during execution.

Hive 0.13 Feature: Subqueries in WHERE Clauses

Previous versions of Hive allowed subqueries only in FROM clauses of SQL statements. A subquery is a SQL expression that returns a set of rows. The subquery is evaluated and its query result set used to evaluate the parent query, the outer query that contains the subquery. Version 0.13 of Hive expands the use of subqueries to include WHERE clauses, as shown in the following example:

SELECT state, net_payments FROM transfer_payments WHERE transfer_payments.year IN (SELECT year FROM us_census);

No configuration is required to enable execution of subqueries in Hive; the feature is available by default. However, several restrictions exist for the use of subqueries in WHERE clauses. The next section, Understanding Subqueries, describes the concepts necessary to understand these restrictions, and the following section, Restrictions on Subqueries in WHERE Clauses, explains the restrictions.

Understanding Subqueries in SQL

SQL adheres to syntax rules like any programming language. The syntax governing the use of subqueries in WHERE clauses in SQL depends on the following concepts:

  • Query Predicates and Predicate Operators

  • Aggregrated and Correlated Queries

  • Conjuncts and Disjuncts

Query Predicates and Predicate Operators

A predicate in SQL is a condition that evaluates to a Boolean value. For example, the predicate in the preceeding example returns true for a row of the transfer_payments table if at least one row exists in the us_census table with the same year as the transfer_payments row. The predicate starts with the first WHERE keyword.

... WHERE transfer_payments.year IN (SELECT year FROM us_census);

A SQL predicate in a subquery must also contain a predicate operator. Predicate operators specify the relationship tested in a predicate query. For example, the predicate operator in the example is the EXISTS keyword.

Aggregated and Correlated Queries

Aggregated queries combine one or more aggregate functions, such as AVG, SUM, and MAX, with the GROUP BY statement to group query results by one or more table columns. In the following example, the AVG aggregate function returns the average salary of all employees in the engineering department grouped by year.

SELECT year, AVG(salary) FROM Employees WHERE department = 'engineering' GROUP BY year

Correlated queries contain a query predicate with the Equals To ( =) operator. One side of the operator must reference at least one column from the parent query and the other side must reference at least one column from the subquery. The following query is a revised and correlated version of the query at the beginning of this section. It is correlated query because one side of the Equals To predicate operator in the subquery references the state column in the transfer_payments table in the parent query and the other side of the operator references the state column in the us_census table.

SELECT state, net_payments FROM transfer_payments WHERE EXISTS (SELECT year FROM us_census WHERE transfer_payments.state = us_census.state);

In contrast, an uncorrelated query does not reference any columns in the parent query.

Conjuncts and Disjuncts

A conjunct is equivalent to the AND condition, while a disjunct is the equivalent of the OR condition. The following subquery contains a conjunct:

... WHERE transfer_payments.year = "2010" AND us_census.state = "california"

The following subquery contains a disjunct:

... WHERE transfer_payments.year = "2010" OR us_census.state = "california"

Restrictions on Subqueries in WHERE Clauses

Subqueries in WHERE SQL clauses have the following limitations:

  • Subqueries must appear on the right hand side of an expression.

  • Nested subqueries are not supported.

  • Only one subquery expression is allowed for a single query.

  • Subquery predicates must appear as top level conjuncts.

  • Subqueries support four logical operators in query predicates: IN, NOT IN, EXISTS, and NOT EXISTS.

  • The IN and NOT IN logical operators may select only one column in a WHERE clause subquery.

  • The EXISTS and NOT EXISTS operators must have at least one correlated predicate.

  • The left side of a subquery must qualify all references to table columns.

  • References to columns in the parent query are allowed only in the WHERE clause of the subquery.

  • Subquery predicates that reference a column in a parent query must use the Equals To ( =) predicate operator.

  • Subquery predicates may not refer only to columns in the parent query.

  • Correlated subqueries with an implied GROUP BY statement may return only one row.

  • All unqualified references to columns in a subquery must resolve to tables in the subquery.

  • Correlated subqueries cannot contain windowing clauses.

Hive 0.13 Feature: Common Table Expressions

A common table expression, or CTE, is a set of query results obtained from a simple query specified within a WITH clause and which immediately preceeds a SELECT or INSERT keyword. A CTE exists only within the scope of a single SQL statement. One or more CTEs can be used with the following SQL statements:

  • SELECT

  • INSERT

  • CREATE TABLE AS SELECT

  • CREATE VIEW AS SELECT

The following example demonstrates the use of q1 as a CTE in a SELECT statement:

WITH q1 AS (SELECT key from src where key = '5') SELECT * from q1;

The following example demonstrates the use of q1 as a CTE in an INSERT statement:

CREATE TABLE s1 LIKE src WITH q1 AS (SELECT key, value FROM src WHERE key = '5') FROM q1 INSERT OVERWRITE TABLE s1 SELECT *;

The following example demonstrates the use of ql as a CTE in a CREATE TABLE AS SELECT clause:

CREATE TABLE s2 AS WITH q1 AS (SELECT key FROM src WHERE key = '4') SELECT * FROM q1;

The following example demonstrates the use of q1 as a CTE in a CREATE TABLE AS VIEW clause:

CREATE VIEW v1 AS WITH q1 AS (SELECT key FROM src WHERE key='5') SELECT * from q1;

CTEs are available by default in Hive 0.13; Hive administrators do not need to perform any configuration to enable them.

Limitations

Common Table Expressions have the following limitations:

  • Recursive queries are not supported.

  • The WITH clause is not supported within subquery blocks.

Hive 0.13 Feature: Quoted Identifiers in Column Names

Hive 0.13 introduced the use of quoted identifiers in the names of table columns. An identifier in SQL is a sequence of alphanumeric and underscore characters surrounded by backtick characters. Quoted identifiers in Hive are case-insensitive. In the following example, `x+y` and `a?b` are valid column names for a new table.

CREATE TABLE test (`x+y` String, `a?b` String);

Quoted identifiers can be used anywhere a column name is expected, including table partitions and buckets:

CREATE TABLE partition_date-1 (key string, value string) partitioned by (`dt+x` date, region int);
CREATE TABLE bucket_test(`key?1` string, value string) clustered by (`key?1`) into 5 buckets;

Enabling Quoted Identifiers

Set the hive.support.quoted.identifiers configuration parameter to column in hive-site.xml to enable quoted identifiers in SQL column names. For Hive 0.13, the valid values are none and column.

hive.support.quoted.identifiers = column

Hive 0.13 Feature: CHAR Data Type Support

Hive 0.13 supports the CHAR data type, which greatly simplifies the process of migrating data from other databases. Hive ignores trailing whitespace characters for the CHAR data type. However, there is no consensus among database vendors on the handling of trailing whitespaces. Users performing data migration to Hive are encouraged to consult the following table before performing the migration to avoid unexpected behavior with values for CHAR, VARCHAR, and STRING data types.

The following table describes how several databases treat trailing whitespaces for the CHAR, VARCHAR, and STRING data types:

Trailing Whitespace Characters on Various Databases

Data Type

Hive

Oracle

SQL Server

MySQL

Teradata

CHAR

Ignore

Ignore

Ignore

Ignore

Ignore

VARCHAR

Compare

Compare

Configurable

Ignore

Ignore

STRING

Compare

N/A

N/A

N/A

N/A

Running Pig with the Tez Execution Engine

By default, Apache Pig runs against Apache MapReduce, but administrators and scripters can configure Pig to run against the Apache Tez execution engine to take advantage of more efficient execution and fewer reads of HDFS. Pig supports Tez in all of the following ways:

Command Line

Use the -x command-line option: pig -x tez

Pig Properties

Set the following configuration property in the conf/pig.properties file: exectype=tez

Java Option

Set the following Java Option for Pig: PIG_OPTS="-D exectype=tez"

Pig Script

Use the set command: set exectype=tez;

Users and administrators can use the same methods to configure Pig to again run against the default MapReduce execution engine.

Command Line

Use the -x command-line option: pig -x mr

Pig Properties

Set the following configuration property in the conf/pig.properties file: exectype=tez

Java Option

Set the following Java Option for Pig: PIG_OPTS="-D exectype=tez"

Pig Script

Use the set command: set exectype=mr;

There are some limitations to running Pig with the Tez execution engine:

  • Queries that include the ORDER BY clause may run slower than if run against the MapReduce execution engine.

  • There is currently no user interface that allows users to view the execution plan for Pig jobs running with Tez. To diagnose a failing Pig job, users must read the Application Master and container logs.

Using PHD for Metadata Services (HCatalog)

Pivotal HD deploys Apache HCatalog to manage the metadata services for your Hadoop cluster.

    Apache HCatalog is a table and storage management service for data created using Apache Hadoop. This includes:
  • Providing a shared schema and data type mechanism.

  • Providing a table abstraction so that users need not be concerned with where or how their data is stored.

  • Providing interoperability across data processing tools such as Pig, MapReduce, and Hive. Start the HCatalog CLI with the following command:

    <hadoop-install-dir>\hcatalog-0.5.0\bin\hcat.cmd

HCatalog includes two documentation sets:

  • General information about HCatalog This documentation covers installation and user features. The next section, Using HCatalog, provides links to individual documents in the HCatalog documentation set.

  • WebHCat information WebHCat is a web API for HCatalog and related Hadoop components. The section Using WebHCat provides links to user and reference documents, and includes a technical update about standard WebHCat parameters.

For more details on the Apache Hive project, including HCatalog and WebHCat, see Using Apache Hive and the following resources:

Using HCatalog

Using WebHCat

WebHCat provides a REST API for HCatalog and related Hadoop components.

Security for WebHCat

WebHCat currently supports two types of security:

  • Default security (without additional authentication)

  • Authentication via Kerberos

Example: HTTP GET :table

The following example demonstrates how to specify the user.name parameter in an HTTP GET request:

% curl -s 'http://localhost:50111/templeton/v1/ddl/database/default/table/my_table?user.name=ctdean'

Example: HTTP POST :table

The following example demonstrates how to specify the user.name parameter in an HTTP POST request

% curl -s -d user.name=ctdean \
       -d rename=test_table_2 \
       'http://localhost:50111/templeton/v1/ddl/database/default/table/test_table'

Security Error

If the user.name parameter is not supplied when required, the following security error is returned:

{
  "error": "No user found.  Missing user.name parameter."
}

Using Apache HBase

Pivotal HD deploys Apache HBase for your Hadoop cluster. HBase is a key- value data store designed for petabyte scale. HBase supports low latency reads, writes, and updates.

New Feature: Cell-level ACLs

HBase 0.98 introduces cell-level access control lists for HBase tables.

New Feature: Column Family Encryption

HBase 0.98 introduces column family encryption.

Using PHD for Workflow and Scheduling (Oozie)

Pivotal HD deploys Apache Oozie for your Hadoop cluster.

Oozie is a server-based workflow engine specialized in running workflow jobs with actions that execute Hadoop jobs, such as MapReduce, Pig, Hive, HDFS operations, and sub- workflows. Oozie supports coordinator jobs, which are sequences of workflow jobs that

are created at a given frequency and start when all of the required input data is available. A command-line client and a browser interface allow you to manage and administer Oozie jobs locally or remotely.

Access the Oozie web UI at the following URL: http://{your.ambari.server.hostname}:11000/oozie after installing a PHD 3.x cluster using Ambari 1.5.x.