Skip to content
English
On this page

Storage Fundamentals

Before we explore the various AWS storage services, let’s review a few storage fundamentals. As a developer, you are likely already familiar with block storage and the differences between hot and cold storage. Cloud storage introduces some new concepts such as object storage, and we will compare these new concepts with the traditional storage concepts with which you are already familiar. If you have been working on the cloud already, these fundamentals are likely a refresher for you.

The goal of this chapter is to produce a mental model that will allow you, as a developer, to make the right decisions for choosing and implementing the best storage options for your applications. With the right mental model, people can usually make the best decisions for their solutions.

The AWS storage portfolio mental model starts with the core data building blocks, which include block, file, and object storage. For block storage, AWS has Amazon Elastic Block Store (Amazon EBS). For file storage, AWS has Amazon Elastic File System (Amazon EFS). For object storage, AWS has Amazon Simple Storage Service (Amazon S3) and Amazon S3 Glacier.

fundamentals

Data Dimensions

When investigating which storage options to use for your applications, consider the different dimensions of your data first. In other words, find the right tool for your data instead of squeezing your data into a tool that might not be the best fit.

So, before you start considering storage options, take time to evaluate your data and decide under which of these dimensions your data falls. This will help you make the correct decisions about what type of storage is best for your data.

Think in terms of a data storage mechanism that is most suitable for a particular workload—not a single data store for the entire system. Choose the right tool for the job.

Velocity, Variety, and Volume

The fi rst dimension to consider comprises the three Vs of big data: velocity, variety, and volume. These concepts are applicable to more than big data. It is important to identify these traits for any data that you are using in your applications.

Velocity : Velocity is the speed at which data is being read or written, measured in reads per second (RPS) or writes per second (WPS). The velocity can be based on batch processing, periodic, near-real-time, or real-time speeds.

Variety : Variety determines how structured the data is and how many different structures exist in the data. This can range from highly structured to loosely structured, unstructured, or binary large object (BLOB) data. Highly structured data has a predefi ned schema, such as data stored in relational databases, which we will discuss in Chapter 4, “Hello, Databases.” In highly structured data, each entity of the same type has the same number and type of attributes, and the domain of allowed values for an attribute can be further constrained. The advantage of highly structured data is its self-described nature. Loosely structured data has entities, which have attributes/fi elds. Aside from the field uniquely identifying an entity, however, the attributes are not required to be the same in every entity. This data is more diffi cult to analyze and process in an automated fashion, putting more of the burden of reasoning about the data on the consumer or application.

Unstructured data does not have any sense or structure. It has no entities or attributes. It can contain useful information, but it must be extracted by the consumer of the data. BLOB data is useful as a whole, but there is often little benefi t in trying to extract value from a piece or attribute of a BLOB. Therefore, the systems that store this data typically treat it as a black box and only need to be able to store and retrieve a BLOB as a whole.

Volume : Volume is the total size of the dataset. There are two main uses for data: developing valuable insight and storage for later use. When getting valuable insights from data, having more data is often preferable to using better models. When keeping data for later use, be it for digital assets or backups, the more data that you can store, the less you need to guess what data to keep and what to throw away. These two uses prompt you to collect as much data as you can store, process, and afford to keep.

Typical metrics that measure the ability of a data store to support volume are maximum storage capacity and cost (such as $/GB).

Storage Temperature

Data temperature is another useful way of looking at data to determine the right storage for your application. It helps us understand how “lively” the data is: how much is being written or read and how soon it needs to be available.

Hot : Hot data is being worked on actively; that is, new ingests, updates, and transformations are actively contributing to it. Both reads and writes tend to be single-item. Items tend to be small (up to hundreds of kilobytes). Speed of access is essential. Hot data tends to be high-velocity and low-volume.

Warm : Warm data is still being actively accessed, but less frequently than hot data. Often, items can be as small as in hot workloads but are updated and read in sets. Speed of access, while important, is not as crucial as with hot data. Warm data is more balanced across the velocity and volume dimensions.

Cold : Cold data still needs to be accessed occasionally, but updates to this data are rare, so reads can tolerate higher latency. Items tend to be large (tens of hundreds of megabytes or gigabytes). Items are often written and read individually. High durability and low cost are essential. Cold data tends to be high-volume and low-velocity.

Frozen : Frozen data needs to be preserved for business continuity or for archival or regulatory reasons, but it is not being worked on actively. While new data is regularly added to this data store, existing data is never updated. Reads are extremely infrequent (known as “write once, read never”) and can tolerate very high latency. Frozen data tends to be extremely high-volume and extremely low-velocity.

The same data can start as hot and gradually cool down. As it does, the tolerance of read latency increases, as does the total size of the dataset. Later in this chapter, we explore individual AWS services and discuss which services are optimized for the dimensions that we have discussed so far.

Data Value

Although we would like to extract useful information from all of the data we collect, not all data is equally important to us. Some data has to be preserved at all costs, and other data can be easily regenerated as needed or even lost without signifi cant impact on the business. Depending on the value of data, we are more or less willing to invest in additional durability.

To optimize cost and/or performance further, segment data within each workload by value and temperature, and consider different data storage options for different segments.

Transient data : Transient data is often short-lived. The loss of some subset of transient data does not have signifi cant impact on the system as a whole. Examples include clickstream or Twitter data. We often do not need high durability of this data, because we expect it to be quickly consumed and transformed further, yielding higher-value data. If we lose a tweet or a few clicks, this is unlikely to affect our sentiment analysis or user behavior analysis.

Not all streaming data is transient, however. For example, for an intrusion detection system (IDS), every record representing network communication can be valuable because every log record can be valuable for a monitoring/alarming system.

Reproducible data : Reproducible data contains a copy of useful information that is often created to improve performance or simplify consumption, such as adding more structure or altering a structure to match consumption patterns. Although the loss of some or all of this data may affect a system’s performance or availability, this will not result in data loss, because the data can be reproduced from other data sources.

Examples include data warehouse data, read replicas of OLTP (online transaction processing) systems, and many types of caches. For this data, we may invest a bit in durability to reduce the impact on system’s performance and availability, but only to a point.

Authoritative data : Authoritative data is the source of truth. Losing this data will have signifi cant business impact because it will be difficult, or even impossible, to restore or replace it. For this data, we are willing to invest in additional durability. The greater the value of this data, the more durability we will want.

Critical/Regulated data : Critical or regulated data is data that a business must retain at almost any cost. This data tends to be stored for long periods of time and needs to be protected from accidental and malicious changes—not just data loss or corruption. Therefore, in addition to durability, cost and security are equally important factors.

One Tool Does Not Fit All

Despite the many applications of a hammer, it cannot replace a screwdriver or a pair of pliers. Likewise, there is no one-size-fi ts-all solution for data storage. Analyze your data and understand the dimensions that we have discussed. Once you have done that, then you can move on to reviewing the different storage options available on AWS to fi nd the right tool to store and access your files.

For the exam, know the availability, level of durability, and cost factors for each storage option and how they compare.

Block, Object, and File Storage

There are three types of cloud storage: object, fi le, and block. Each offers its own unique advantages.

Block Storage

Some enterprise applications, like databases or enterprise resource planning systems (ERP systems), can require dedicated, low-latency storage for each host. This is analogous to direct-attached storage (DAS) or a storage area network (SAN). Block-based cloud storage solutions like Amazon EBS are provisioned with each Amazon Elastic Compute Cloud (Amazon EC2) instance and offer the ultra-low latency required for high-performance workloads.

Object Storage

Applications developed on the cloud often take advantage of object storage’s vast scalability and metadata characteristics. Object storage solutions like Amazon S3 are ideal for building modern applications from scratch that require scale and flexibility and can also be used to import existing data stores for analytics, backup, or archive. Cloud object storage makes it possible to store virtually limitless amounts of data in its native format.

File Storage

Many applications need to access shared files and require a file system. This type of storage is often supported with a network-attached storage (NAS) server. File storage solutions like Amazon EFS are ideal for use cases such as large content repositories, development environments, media stores, or user home directories.

AWS Shared Responsibility Model and Storage

The AWS shared responsibility model is important to understand as it relates to cloud storage. AWS is responsible for securing the storage services. As a developer and customer, you are responsible for securing access to and using encryption on the artifacts you create or objects you store. AWS makes this model simpler for you by allowing you to inherit certain compliance factors and controls, but you must still ensure that you are securing your data and files on the cloud. It is a best practice always to use the principle of least privilege as part of your responsibility for using AWS Cloud storage. For example, ensure that only those who need access to the file have access and ensure that read and write access are separated and controlled.

Confidentiality, Integrity, Availability Model

The confidentiality, integrity, availability model (CIA model) forms the fundamentals of information security, and you can apply the principles of the CIA model to AWS storage. Confidentiality can be equated to the privacy level of your data. It refers to levels of encryption or access policies for your storage or individual files. With this principle, you will limit access to prevent accidental information disclosure by restricting permissions and enabling encryption.

Integrity refers to whether your data is trustworthy and accurate. For example, can you trust that the fi le you generated has not been changed when it is audited later?

Restrict permission of who can modify data and enable backup and versioning.

Availability refers to the availability of a service on AWS for storage, where an authorized party can gain reliable access to the resource.

Restrict permission of who can delete data, enable multi-factor authentication (MFA) for Amazon S3 delete operation, and enable backup and versioning.

The CIA model

cia

AWS storage services provide many features for maintaining the desired level of confi dentiality, integrity, and availability. Each of these features is discussed under its corresponding storage-option section in this chapter.