Cluster Capacity Planning

Planning resources for Ampool Data Store (ADS) in the production environment requires various considerations. One common questions is how big an Ampool Data Store should be given the size of data to be stored, daily/hourly data ingestion rate etc. This document describes key considerations specifically to determine memory size and CPU for the Ampool Data Store.

Ampool Table Types

Typically different Ampool Table types are used to store data with different access characteristics. The resource requirements for the tables also vary based on the table type.

  • MTable: Mutable in-memory table that support both range and hash partitioning schemes based on the row key. Hash partitioned tables are "unordered" tables which provide balanced distribution of data across multiple table partitions/buckets and suitable for low latency get/put(update) workloads. Range partitioned tables are "ordered" tables which provide ordering of table rows based on the row key and they are also suitable for range/scan queries based on the row key in addition to get/put(update) workloads. In data warehousing context typically these tables are used as mutable dimension tables.

  • FTable: Immutable in-memory table. Data ingested in this table is immutable, hash partitioned on specified table column and internally ordered based on ingestion time. Ingestion time is made available as additional column in the table for user to filter and efficiently retrieve the data over a given time range. This table only supports "Append" and "Scan" operations. Being immutable records, this table internally stores multiple records in a single block and this significantly lowers the memory overhead per row compared to MTable incurred due to internal housekeeping Ampool metadata. FTable does not require user to specify a row key as it does not support get/put operations.

FTable has a concept of data tiers i.e. when amount of data stored in Ampool crosses heap eviction threshold for any Ampool server, data is evicted to next tier of storage to free up the memory space. Although Ampool supports multiple tiers of storage, in release 1.2, local disk tier is supported where spilled data is stored in ORC format. Evicted data is on disk is a seamless extension of data stored in memory and Ampool serves it transparently to user if query needs access to data on local disk tier.

Memory Sizing

Ampool can provide a memory sizing worksheet to compute how much memory (RAM) Ampool requires to store given amount of data. Amount of overhead Ampool has per GB of data stored depends on multiple factors. As mentioned above typically Ampool has to keep some housekeeping metadata for each row in the table and this metadata varies based type of the table as well as table configuration e.g. max number of versions to keep for each table row, number of columns in a row, disk persistence enabled/disabled for the table and the redundancy factor for the table.

Note

Typically amount of overhead per GB is lower with fewer number of records per GB of data volume and also fewer number of columns per record

In the memory sizing worksheet takes input data in GB per day, Record key/value sizes, number of columns in value part and eviction threshold etc and computes number of records expected per day, overhead per record and thus finally how much memory required to store those records (both key/values) in memory. Eviction threshold is important as it dictates what percent of total Ampool memory to use for actually storing the data in memory while remaining is reserved for Ampool's internal system data.

It is important to note that client can ingest lot more data into Ampool table even with calculated required amount of memory through sizing sheet. The calculated required memory in a sizing sheet is w/ the assumption that input data (both key/values) are expected to be in memory. When more data is ingested in Ampool table excess data would get evicted to disk and would incur some performance overhead while accessing it. Both MTable and FTable have different eviction strategies.

MTable always keep all the row keys for a table in memory while only values (table columns) are evicted to local disk when in-memory eviction threshold is reached. Which data values to evict is based on LRU algorithm. So in theory depending on the size of row key (Column N, Ampool RowKey Size in sizing sheet), max number of records MTable can hold for a given amount of memory can be estimated.

FTable evicts both keys and values to next tier of storage. So theoretically there is no limit on how much data you can ingest into FTable. Sizing sheet provides how much memory you need in order to keep recent "N" GB of data in memory.

CPU Sizing

Typically Ampool recommends 1 CPU Core (2 vCPUs) to manage 16G of RAM under Ampool Server. So for example, if Ampool Server is managing 64G of RAM on the machine, then it is recommended to have 4 CPU cores to be reserved for the Ampool Server. This is primarily for background housekeeping services Ampool runs in order to garbage collect, evict the data from memory to next tier etc. Depending on the compute requirements of clients colocated with Ampool Servers, more CPU cores would be needed per Ampool Server Node.

Ampool also support co-processor execution i.e. client code is executed on the Ampool Server. This should also be factored in as client compute requirements and not part of Ampool's basic housekeeping data management work.

Network recommendations

Ampool being in memory data store the latency of data access is order of magnitudes less than disks. So if Clients are accessing data over network, we expect to have better network latency and throughput. Although in practice we expect clients to colocate w/ Ampool Servers to justify the low latency data access but in case of remote clients co-processor usage is really recommended as much possible.