With Apache Hive

Introduction

Ampool has developed a new storage handler for Hive that stores data into an Ampool MTable rather than HDFS. Ampool, a highly concurrent in-memory store, is able speed up response time for many queries, complete queries that may otherwise fail due to lack of resources. It also allows for greater concurrency and data access by multiple jobs is less sensitive to I/O bandwidth issues.

How to Use

Un-tar the package and you should have the following:

ampool-hive-<version>
├── dependencies
│   ├── ampool-client-dependencies-<version>.jar
│   ├── ampool-core-<version>.jar
│   ├── fastutil-7.0.9.jar
│   ├── javax.transaction-api-1.2.jar
│   ├── log4j-api-2.5.jar
│   └── log4j-core-2.5.jar
│   └── shiro-core-1.2.4.jar
├── lib
│   └── ampool-hive-<version>.jar
└── README.txt

The lib directory contains the jar file for this connector and dependencies directory contains all dependent libraries.

For using Ampool-Hive integration, you will need a running Ampool cluster. The details for setting up an Ampool cluster can be found here. Once the cluster is setup, the locator host-name and port will be required further.

Ampool Properties

Table Creation Properties

Following set of properties can be used when using Ampool storage handler for Hive tables. These properties can be specified via tblproperties when creating the tables from Hive CLI. Some properties (like monarch.locator.host) are client level properties and these are used only once and the same client is used for all further interactions. Such properties are ignored, even if specified, for all subsequent commands.

Property Details Default Value
monarch.locator.host The locator host-name to connect.
This is a client level property and utilized only for the first time when creating a client connection.
localhost
monarch.locator.port The locator port to connect.
This is a client level property and utilized only for the first time when creating a client connection.
10334
monarch.read.timeout The client-read-timeout in milliseconds.
This is a client level property and utilized only for the first time when creating a client connection.
10000
monarch.redundancy The number of redundant copies of the data 0
monarch.buckets The number of buckets/partitions 113
monarch.table.type The table type to be used for reading/writing from Ampool.
The value should be either of: immutable, unordered, ordered.
immutable
monarch.enable.persistence The mode of writing data to persistent storage.
The value should be either: async or sync.
Applicable for unordered and ordered tables.
None
monarch.partitioning.column The partitioning column (hash-code of the column-value) to be used for data distribution.
Applicable only for immutable tables.
None
monarch.max.versions The maximum number of versions to be stored in a table.
Applicable only for ordered tables.
None
monarch.write.key.column The index of the column that should be used as key when writing data into mutable Ampool tables.
Applicable for unordered and ordered tables.
None
monarch.block.size The number of rows to be stored together as a single block in Ampool immutable tables.
Applicable for immutable tables only.
1000
monarch.block.format The format used to store data for a single block when it reaches specified block-size.
The value should be either of : AMP_BYTES or AMP_SNAPPY or ORC_BYTES.
Applicable for immutable tables only.
AMP_BYTES

Following block formats are supported for Immutable tables: AMP_BYTES: The rows are serialized using Ampool specific encoding and stored individually. This may consume more memory but there is no additional overhead during scan to retrieve a row. AMP_SNAPPY: The rows serialized using Ampool specific encoding and compressed using Snappy compression upon reaching the specified block size. This will help reducing memory usage but all rows will have to be decompressed during scan. * ORC_BYTES: All rows from a block will be converted to ORC columnar format upon reaching the specified block size. Then each block will contain binary ORC data, representing the rows, which will be interpreted during scan.

Query Properties

Apart from the above properties (available at table creation time), following properties can be used from Hive CLI when querying the data from Ampool tables.

Property Details Default Value
monarch.read.max.versions The number of maximum versions to be retrieved during the query. 1
monarch.read.filter.latest Use only latest version for executing filters, if any, during the query. true

For example, you can set a property from Beeline before executing a query:

set monarch.read.max.versions=10;
select count(*) from ampool_table;

Above query will retrieve maximum upto 10 versions of each row, if present, from ampool_table.

Note: The query properties are applicable only for the tables that have more than one version. At the moment, these will be applicable only for ordered tables.

The Column Types

Ampool MTable supports all Hive types. The column types from a Hive are mapped to the respective MTable column types. All the basic types with following complex types are supported:

  • Array
  • Struct
  • Map
  • Union

Using Beeline or JDBC API with Hive Server2

(HIVE_HOME is hive installation directory); eg: if this is an HDP installation: /usr/hdp//hive

  1. On the machine running Hive server 2, run the following commands:
$ sudo mkdir $HIVE_HOME/auxlib
$ sudo cp ampool-hive-<version>/lib/*.jar to  $HIVE_HOME/auxlib
$ sudo cp ampool-hive-<version>/dependencies/*.jar to  $HIVE_HOME/auxlib
  1. Restart hive server 2.

Running Hive Queries (using Beeline)

To run Hive queries (create/insert/select) using Beeline you need: 1. A running Ampool cluster 2. The required libraries are copied to $HIVE_HOME/auxlib and Hive server is restarted

Create Table

Creating a Hive table with external storage handler, stores only the table meta-data in Hive whereas the actual table data is stored by the respective storage handler (i.e. Ampool in this case).

Following command can be used from Beeline to create a table:

CREATE TABLE ampool_table_1(id INT, time STRING, count BIGINT)
   STORED BY "io.ampool.monarch.hive.MonarchStorageHandler"
   tblproperties ("monarch.locator.host"="localhost", "monarch.locator.port"="10334");

The command above creates a table called ampool_table_1 with Ampool storage handler. Eventually, when the data is inserted, it is be stored in the respective Ampool cluster. By default, the table will be created with as _ using the respective details from Hive.

In case you want to name the Ampool table differently, you can provide another property monarch.region.name and that name will be used for Ampool table. For example:

CREATE TABLE ampool_table_1(id INT, time STRING, count BIGINT)
   STORED BY "io.ampool.monarch.hive.MonarchStorageHandler"
   tblproperties ("monarch.locator.host"="localhost", "monarch.locator.port"="10334", "monarch.region.name"="my_ampool_table");

Insert Data

You can use following command (or any other equivalent command) to insert data into a table backed by Ampool:

INSERT OVERWRITE TABLE ampool_table_1 SELECT my_id, my_time, my_count FROM <existing-hive-table>;

Query Data

Standard Hive queries can be used as it is to retrieve the data in Hive tables backed by Ampool. For example:

SELECT * FROM ampool_table_1;
SELECT * FROM ampool_table_2;

Load Existing Ampool table as External table

To create external Hive table which references existing (pre-created) MTable as underlying storage as:

CREATE EXTERNAL TABLE ampool_external_table(id INT, time STRING, count BIGINT)
   STORED BY "io.ampool.monarch.hive.MonarchStorageHandler"
   tblproperties ("monarch.locator.host"="host", "monarch.locator.port"="port", "monarch.region.name"="my_ampool_table");

Note

Existing MTable column types must match the respective column types of the external hive table.

Drop Table

The Hive tables backed by Ampool can be deleted as normal Hive tables. When you drop a Hive table, respective Ampool table is also deleted, except for external tables. When external Hive table is dropped, only the Hive table definition is dropped, keeping Ampool table untouched.

To drop a table, you can use the following command:

DROP TABLE ampool_external_table;
DROP TABLE ampool_table_1;

Working with Kerberised environment.

Since 1.4.3

Prerequisite:

  • Configured Hadoop with kerberos
  • Kerberos client is enabled on all user nodes and user needs to have a valid kerberos ticket.
  • Setup hive connector on hadoop cluster nodes as explained above.

Following is the high level architecture that describes,

  • As ampool does not use krb auth in this architecture, the network access to ampool cluster should be restricted so that only the hadoop nodes are able to make connections to the ampool cluster.

working with kerberised environment

Connecting with kerberised Ampool cluster

Following properties are used while connecting with kerberised ampool cluster.

Property Details
security.enable.kerberos.authc true or false
security.kerberos.service.principal service pricipal of ampool cluster
security.kerberos.user.principal client user pricipal for ampool client
security.kerberos.user.keytab keytab file path for the client user

Keytab file provided in the security properties should be present on all worker nodes.

Example create command using security properties:

CREATE TABLE ampool_secured_table(id INT, time STRING, count BIGINT) 
  STORED BY "io.ampool.monarch.hive.MonarchStorageHandler" 
  tblproperties ("monarch.locator.host"="localhost", "monarch.locator.port"="10334", 
  "security.enable.kerberos.authc"="true", "security.kerberos.service.principal"="locator/localhost@TEST.AMPOOL.IO",  
  "security.kerberos.user.principal"="ampoolhive@TEST.AMPOOL.IO", "security.kerberos.user.keytab"="/tmp/ampoolhive1.keytab");

Attention

The current Kerberos implementation has a few limitations at this time:

  • The Hive connector does not support user impersonation. Therefore, a service user should be used for security.kerberos.user.principal and security.kerberos.user.keytab, as shown above

  • Authorization of operations through Hive is not supported within Ampool ADS as it doesn’t receive user credentials. Therefore, authorization needs to be handled at Hive or at a different level

In future, we plan to pass individual user credentials to Ampool ADS through Hive connector so that the operations can be carried out with those credentials.