Architecture

Architecture Overview

Ampool Active Data Store (ADS) extends Apache Geode, a proven distributed in-memory data store, with many new features to enable efficient data storage, access and processing for a wide variety of data processing applications. This makes Ampool a single unified data store across different transactional and analytical applications and use-cases.

Apache Geode is a data management platform that provides real-time, consistent access to latency sensitive applications in a variety of distributed cloud architectures. Geode pools memory, CPU, network resources, and optionally local disk across multiple processes to store and access application data and behavior. It uses dynamic replication and data partitioning techniques to implement high availability, improved performance, scalability, and fault tolerance. In addition to being a distributed data container, Geode is an in-memory data management system that provides reliable asynchronous event notifications and guaranteed message delivery. For more details, see Apache Geode

The primary data storage abstraction in Geode is called a Region, which is a distributed Key/Value store or a distributed Hash Map, where application can store its data records as key/value pairs. Also indexed lookup is the primary data access patten supported by Geode's in-memory Region. Applications can efficiently insert, update or delete one or more records specifying the record keys. Therefore, Geode's existing Region abstraction is very suitable for distributed, large-scale, low latency OLTP applications.

For addressing analytical workloads, namely scanning all entries that satisfy a given criteria, Ampool ADS has additional capabilities, some of which are described here.

Key Features

Following are some of the key system features of Ampool ADS:

Note

Some of the features are still in flight; for more details, please contact us via Ampool website|

Unified data store w/ new data storage abstractions (aka Ampool Table Types)

Ampool ADS extends Geode, a proven distributed in-memory data store to support many new data storage abstractions to support variety other data processing applications where indexed look-ups is not the only data access pattern. Think of these data stroage abstractions as commonly used in-memory data structures by applications for effcient storage & retrieval of the data. Ampool refers these new storage abstractions as "Ampool Tables" as data stored is a distributed collection of row-tuples (aka tables in RDBMS). Ampool ADS not only extend Apache Geode platform with more data storage abstractions but also adds necessary enterprise grade support for Quality of Service (QoS), Multi-tenancy, Security, etc making it a single, unified data store for enterprise applications.

Memory as a cache vs Mutable in-memory store

Ampool ADS as an in-memory data store provides mutable data storage abstractions that allow real-time updates to data. This makes it attractive for various near-app data processing applications as opposed to traditional use of memory as a cache. Typically, enterprises use a variety of data processing frameworks, where Ampool ADS provides a very low latency, fast data exchange platform across the application jobs written using these compute frameworks. For example, a data set stored in Ampool ADS that is output by a Spark job can be quickly accessed by the next job in the pipeline. Also, the associated metadata stored in mutable dimension tables can be updated in real time, thus significantly reducing the overall application processing latency.

High throughput data movement for analytics & BI applications

New data types added in Ampool ADS support scan oriented worklods e.g. range queries where client needs to access set of records in a single operation. Ampool ADS optimizes the scan operation and ensures that minimum data is transferred between Ampool server and clients through compaction, predicate push-down and function execution on the server side.

Hierarchical tiered storage Support w/ seamless access/query support

Ampool ADS supports not only in-memory storage but also policy-based eviction of data to the next tier of storage, which typically bears higher capacity and order of latency than the previous tier. Both local storage, such as NVDIMMs, Storage Class Memory & local SSDs/Disks, as well as remote shared storage, such as Apache HDFS, HBase & Amazon S3 can be configured for Ampool ADS. Ampool ADS provides both memory based and time based eviction policies. Data is evicted to next tier as it gets older or if storage tier reaches it's eviction threshold. Ampool not only supports tiered storage but also allows applications to query the data across them seamlessly i.e. Ampool internally fetches data from whichever tier it resides on without application to specify it explicitly.

Co-processor support for executing business logic closer to data

Co-processors are user code modules that exceute before or after the supported trigger events. Ampool provides Coprocessor framework to run the user’s custom code on an Ampool server. For more details see

Partitioning, Replication & Persistence of data for scalability, availability & recoverability

Ampool ADS supports both hash and range partitioning for it's new region types depending on the ordering requirements for data access. User can specify the appropriate partitioning key to balance the data distribution accross the cluster nodes. Partitions (aka Buckets) are created lazily if and when any new records are created for it. User can also define the redundancy factor for each region to maintain replicas of partitions on different nodes for availability in case if one of the nodes goes down, data will still be available from other nodes.

Ampool also supports both synchronous and asynchronous persistence of in memory data in case of restart or crash recovery of the node.

Security and Multi-user access control

Ampool extends Geode's Security Framework for new data storage abstractions it supports.

Native Java API's for extensibility

Ampool ADS publishes the Java APIs for new region types so that system developers can write connectors for various compute frameworks, and application developers can use Ampool ADS directly as an in-memory data store.

Ampool also provides out-of-the-box support for some of the popular and commonly used compute frameworks to use Ampool ADS as an underlying data store, e.g. Apache Spark, Apache Hive, Apache Apex, Python, R etc. Thus, a user can make use of their existing data processing applications written for these compute frameworks, as-is without any major changes.

Typically Ampool ADS is co-located w/ compute cluster nodes and stores the hot data close to these applications, while cold data is archived to long-term (shared) storage. As Ampool provides a seamless query access across hot & cold data tiers, it eliminates the need of a Lambda architecure separating real-time vs batch data processing systems and complicated merging of the output in the final warehousing stage.

Recomended Ampool Architecture

Ampool Cluster Topology

Ampool ADS topology is based on a locator-server configuration of Apache Geode. Ampool tables are configured as partitioned regions, which means that region data is split and placed into partitions (buckets) by a primary key.

Ampool Cluster Architecture Picture

The data for an in-memory table is split into a number of buckets based on a paritioning rule (the method depends on the table type). These buckets may also be replicated by a user provided factor to provide availability of data even in the case of crashes and machine failures. The figure below shows a simple example of bucket partitioning.

Note

It is a good idea to have multiple locators in a cluster to ensure continued availability in case a locator machine fails.

Ampool Bucket Distribution Example