Ampool extends Geode's core region concept to add more data storage abstractions (aka core data structures). In Geode, region at its core is a distributed key/value Map. Geode views multiple region types based on behavioral system properties of this Key/Value Map region in a distributed system e.g. is persistence enabled, is region replicated vs partitioned etc. See Region Types in Geode. Ampool bit differs w/ Geode in how it sees the new region types where each region type projects a behavior of collection data structure. New regions that Ampool introduces has mostly row-tuple (tabular) format for the values and hence we primarily refer them as Ampool Tables.
MTable at the core is Geode's key/value region and hence already support the efficient lookup/access based on the record key although Ampool augments it with additional key features to make it also suitable for analytics applications.
MTables are typically used as mutable dimension tables in the context of datawarehousing & BI Analytics.
- Support for Scan operation on the table
- Support for both ORDERED (range partitioned) & UNORDERED (Hash partitioned) tables
- Binary and Typed column support
- Row versioning for ORDERED/UNORDERED MTable
- Row key and column based filters (including capability for filter push down close to servers where data resides)
The native Java API for MTable are very similar to the HBase API (HBase Documentation), though differences and support for features exist. MTable-related classes are prefaced with an "M" (MTable, MScan, MFilter...). For additional details, please refer the Mtable API javadocs, which are included in the binary distribution.
The region for MTable is configured to be a partitioned region. This means that data is split using a partition function and distributed to multiple buckets. The creator of the table indicates the table type (ordered or unordered) and the replication factor (number of redundant copies). Table meta-data is replicated to all nodes while the number of splits (buckets) and replication factor are configured by the creator for the table data.
Geode terminology uses the term buckets for the partitioned data, whilst HBase calls these splits.
MTable rows must have an associated unique primary key, and ordering is done for ordered tables based on this key.
MTable data supports versioning such that a user configurable number of row versions are maintained.
Basic MTable Operations¶
The MTable API allows one to create tables and optionally associate types with table columns. Values inserted are always as byte if types area associated with a column to allow certain performance and functional enhancements when doing comparison operations. Other operations are:
- put/update individual row or lists of rows
- get individual rows
- delete individual rows
- delete (remove) a table
- scan a table based on a filter
- add a co-processor to a table (similar in design to HBase co-processors)
Ordered tables are stored in primary key order and use range partitioning similar to what HBase uses. This type of table may be scanned using start/stop ranges and filters. Per-bucket scans are supported. A fast parallel scan interface is available for ordered tables but this does not preserve order.
Users define the start and stop keys for an ordered table and the system splits according to this range. Attempts to add keys outside the range generate an exception.
Ordered tables support versioning of rows.
Range partitioning and comparison uses lexicographical ordered on the keys as byte so care must be taken when defining ranges and range based operations
Unordered tables do not use key ordering and are assigned to buckets (splits) based on a hash function. They support per bucket and fast parallel scan.
Unordered tables also support versioning of rows.
Memory Management Options¶
The system will actively try and manage memory by taking an eviction action on least recently used data when heap usage rises to a configurable threshold. This threshold is set at server start-up time. Tables that are set to use eviction (memory management) will be trimmed when the set threshold is reached. By default memory management is on for new tables.
For detailed Apache Geode information on memory management see Apache Geode Memory Management
Users may disable the eviction and overflow functionality on a per table basis at table create time but be warned that tables for which eviction is disabled have no limit on the memory they can consume as the system will not include them when overflowing data to free memory.
Data Eviction (Overflow of values) to Disk¶
By default Ampool MTable evicts row values (record column values) from memory if usage crosses eviction heap percentage threshold configured for the Ampool servers. The eviction of values happen based on LRU. Although all the keys for all the rows of MTable are always kept in memory for efficient lookup and if any of the values to be updated are already evicted to disk then they are first brought into memory before update. The disk layout for overflow is optimized for access. Administrators may change the eviction heap percentage threshold for overflow when starting servers. Also ampool_server_system.properties allows specification of "-Dgemfire.HeapLRUCapacityController.evictHighEntryCountBucketsFirst" property (true/false) to select high entry count tables for eviction first.
Administrators can also configure a threshold for critical heap percentage where if reaches, application ingesting the data gets the low memory exception and have to wait and retry till heap usage lowers below the critical threshold either due to GC trigger or through overflow/eviction process.
In the Mash shell while starting a Ampool server (start server command) use
--critical-heap-percentage to set the percentages respective percentages
By default eviction-heap-percentage is 90% and critical-heap-percentage is 98% in Ampool cache.xml. Typically critical heap percentage should be set to 90%, keeping 8-10% free for Ampool internal memory consumption. Also eviction heap percentage depends data ingestion rate i.e. how fast memory heap is being filled up so that older data can be evicted to make room for new data. Typically 75/80% is a good eviction threshold if the disks are SSDs and support faster eviction.
For detailed Apache Geode documentation of eviction and critical heap percentage see Apache Geode Heap management and Apache Geode Server Configuration Parameters (cache.xml).
Users may enable disk persistence for MTables. When this feature is enabled data inserted into the table will be asynchronously saved to disk and will be used to re-load table data into memory on crash. Data can be persisted synchronously or asynchronously (default).
Disk persistence is essentially a write ahead log that is used to recover the Ampool table data in case of server restart or crash.
Expiration of data¶
Expiration of data rows is supported based on timeout specified i.e. During creation of MTable "--expiration-timeout=" can be specified in seconds and also the "--expiration-action=" can be specified either "DESTROY" or "INVALIDATE". Rows will accordingly be acted upon after timeout.
The FTable is an append only table that would, typically, hold the data to be analyzed spread across multiple tiers of storage. Such data will be "immutable" and once ingested cannot be modified. Each tier would be configured to hold a fixed size of the data and upon reaching the respective threshold the data would be automatically moved (evicted) to the subsequent tier or permanently expired/deleted from the system to make room for the new data. The tiers would typically be in ascending order of the data-access-latency, with first tier is in-memory tier (tier-0) being the fastest to access the data (i.e. lowest latency) whereas the last tier is typically shared archive storage like S3, HDFS and would have the highest latency. FTable supports "append" and "scan" operations and the data movement across tiers would be transparent to the users and handled automatically by the system based on eviction policy. The data stored in non-memory tiers would be in open ORC storage format, and hence can be accessed using the respective public set of APIs outside Ampool.
For 1.2 release only two tiers are supported: in-memory tier and local-disk tier w/ ORC as storage format.
- Append only and Immutable – data once ingested cannot be modified (Note: In future bulk update/delete may be supported).
- The data is partitioned across cluster nodes where one or more columns of the table can act as partitioning key. If no partition key columns specified, entire row is used as partitioning key w/ simple Hash based partitioning.
- Table supports configurable redundancy i.e. one or more copies of data is stored in the cluster across cluster nodes for the availability.
The schema supports all the column types (same as Apache Hive) including basic and complex ones. See Column Types
- For in-memory tier, the rows are stored in a block (of configurable length) against a single key (contrary to the single key-value pair per row). This minimizes the per entry memory overhead and help to reduce overall memory footprint
- The column value, if specified during table creation, is used for distribution of records across partitions/buckets in the distributed system. No range partitioning supported for distributing the data across partitions.
- The record insertion time is used as a special column to order the data within a partition/bucket. This insertion-time based ordering of data within a bucket help efficiently retrieve only the required set of records over a given time range instead of scanning the entire bucket or table.
- Support for efficient scan query given a specific partition key and insertion time range for the records. (Note: Scan does NOT provide total ordering of data based on insertion time across multiple buckets of a table).
- Seamless Scan across all storage tiers for the table i.e. records retrieved over a given time range irrespective of their spread across multiple tiers of FTable storage.
- Supports column based filtering of records where filters are executed on the Ampool server to lower the data transfter.
Data Movement across tiers¶
FTable supports movement of data from one tier to the subsequent tier based on either ingestion-time (time based data movement) or total size consumed (resource based data movement) in the tier. In Ampool, tier-0 is memory and all subsequent tiers (tier-1, tier-2, ...) are disk based tiers either local (local-disk) or shared (HDFS/S3).
The data management in tier-0 is inherited from Geode and, today, eviction mechanism is used to move data from tier-0 to tier-1 if the heap usage crosses the eviction threshold. FIFO (first in first out) policy is used to evict the records to next tier (i.e. older entries are moved before newer ones). Whole block of records is moved to the subsequent tier rather than individual records. For tier-0, eviction threshold is specified while starting the Ampool Servers.
The data movement in tier-1 to N is done based on TTL/timeout i.e. (current time - timeout) > insertion time of the row then the row is moved to next tier. If next tier does not exist then row is deleted. The data movement polcy with timeout is defined per table during the table creation and can be specified separately for each tier. If not specified default timeout is infinite i.e. data will not be moved out of this tier. Whole block of records is moved to the subsequent tier rather than individual records.
The ampool_server_system.properties allows specification of "-Dgemfire.HeapLRUCapacityController.evictHighEntryCountBucketsFirst" property (true/false) to select high entry count tables for eviction first.
Asynchronous disk persistence is by default enabled for FTable (Note: In 1.2 release disk persistence can not be disabled). When this feature is enabled data inserted into the table will be asynchronously saved to disk and will be used to re-load table data into memory on crash. Only data for In-memory tier-0 is stored in persistence store for recovery after node failure to recreate the in-memory state. The data in next tiers is already persisted on the disk and optionally replicated to other nodes through redundancy factor. Data can be persisted synchronously or asynchronously (default).
Disk persistence is essentially a write ahead log that is used to recover the Ampool's in-memory tier-0 data in case of server restart or crash.