Skip to content
English
On this page

Data Warehouse

If you are performing analytics, you may want to use a data warehouse. A data warehouse is a central repository of information that you can analyze to make better-informed decisions. Data flows into a data warehouse from transactional systems, relational databases, and other sources, typically on a regular cadence. Business analysts, data scientists, and decision-makers access the data through BI tools, SQL clients, and other analytics applications.

Data Warehouse Architecture

A data warehouse architecture consists of three tiers. The bottom tier of the architecture is the database server, where data is loaded and stored. The middle tier consists of the analytics engine that is used to access and analyze the data. The top tier is the front-end client that presents results through reporting, analysis, and data mining tools. A data warehouse works by organizing data into a schema that describes the layout and type of data, such as integer, data field, or string. When data is ingested, it is stored in various tables described by the schema. Query tools use the schema to determine which data tables to access and analyze.

Data Warehouse Benefits

Benefits of using a data warehouse include the following:

  • Better decision-making
  • Consolidation of data from many sources
  • Data quality, consistency, and accuracy
  • Historical intelligence
  • Analytics processing that is separate from transactional databases, improving the performance of both systems

The data warehousing landscape has changed dramatically in recent years with the emergence of cloud-based services that offer high performance, simple deployment, nearinfinite scaling, and easy administration at a fraction of the cost of on-premises solutions.

Comparison of Data Warehouses and Databases

A data warehouse is specially designed for data analytics, which involves reading large amounts of data to understand relationships and trends across the data. A database is used to capture and store data, such as recording details of a transaction. The table is useful in comparing the characteristics of data warehouses and databases.

Comparison of Data Warehouse and Database Characteristics

CharacteristicsData WarehouseTransactional Database
Suitable WorkloadsAnalytics, reporting, big dataTransaction processing
Data SourceData collected and normalized from many sourcesData captured as-is from a single source, such as a transactional system
Data CaptureBulk write operations typically on a predetermined batch scheduleOptimized for continuous write operations as new data is available to maximize transaction throughput
Data NormalizationDenormalized schemas, such as the star schema or snowflake schemaHighly normalized, static schemas
Data StorageOptimized for simplicity of access and high-speed query performance by using columnar storageOptimized for high-throughout write operations to a single roworiented physical block
Data AccessOptimized to minimize I/O and maximize data throughputHigh volumes of small read operations

Comparison of Data Warehouses and Data Lakes

Unlike a data warehouse, a data lake, as described in Chapter 3, “Hello, Storage,” is a centralized repository for all data, including structured and unstructured. A data warehouse uses a predefined schema that is optimized for analytics. In a data lake, the schema is not defined, enabling additional types of analytics, such as big data analytics, full text search, real-time analytics, and machine learning. The table compares the characteristics of a data warehouse and a data lake.

Comparison of Data Warehouse and Data Lake Characteristics

CharacteristicsData WarehouseData Lake
DataRelational data from transactional systems, operational databases, and line-of-business applicationsNonrelational and relational data from IoT devices, websites, mobile apps, social media, and corporate applications
SchemaDesigned before the data warehouse implementation (schema-on-write)Written at the time of analysis (schema-on-read)
Price/PerformanceFastest query results by using higher-cost storageQuery results getting faster by using low-cost storage
Data QualityHighly curated data that serves as the central version of the truthAny data that may or may not be curated (in other words, raw data)
UsersBusiness analysts, data scientists, and data developersData scientists, data developers, and business analysts (using curated data)
AnalyticsBatch reporting, BI, and visualizationsMachine learning, predictive analytics, data discovery, and profiling

Comparison of Data Warehouses and Data Marts

A data mart is a data warehouse that serves the needs of a specific team or business unit, such as finance, marketing, or sales. It is smaller, is more focused, and may contain summaries of data that best serve its community of users. The table compares the characteristics of a data warehouse and a data mart.

Comparison of Data Warehouse and Data Mart Characteristics

CharacteristicsData WarehouseTransactional Database
ScopeCentralized, multiple subject areas integrated togetherDecentralized, specific subject area
UsersOrganization-wideA single community or department
Data SourceMany sourcesA single or a few sources, or a portion of data already collected in a data warehouse
SizeLarge—can be 100s of gigabytes to petabytesSmall, generally up to 10s of gigabytes
DesignTop-downBottom-up
Data DetailComplete, detailed dataMay hold summarized data

Amazon Redshift

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data by using standard SQL and your existing BI tools. With Amazon Redshift, you can run complex analytic queries against petabytes of structured data using sophisticated query optimization, columnar storage on highperformance local disks, and massively parallel query execution. Most results come back in seconds. Amazon Redshift is up to 10 times faster than traditional on-premises data warehouses at 1/10 the cost.

Architecture

An Amazon Redshift data warehouse is a collection of computing resources called nodes, which are organized into a group called a cluster. Each cluster runs an Amazon Redshift engine and contains one or more databases. After you provision your cluster, you can upload your dataset and then perform data analysis queries. Each cluster has a leader node and one or more compute nodes, and you have a choice of a hardware platform for your cluster.

Client Applications

Amazon Redshift integrates with various data loading and extract, transform, and load (ETL) tools and BI reporting, data mining, and analytics tools. It is based on open standard PostgreSQL, so most existing SQL client applications will integrate with Amazon Redshift with only minimal changes. For important differences between Amazon Redshift SQL and PostgreSQL, see the Amazon Redshift documentation.

Leader Node

The leader node acts as the SQL endpoint and receives queries from client applications, parses the queries, and develops query execution plans. The leader node then coordinates a parallel execution of these plans with the compute nodes and aggregates the intermediate results from these nodes. Finally, it returns the results to the client applications. The leader node also stores metadata about the cluster. Amazon Redshift communicates with client applications by using open standard PostgreSQL, JDBC, and ODBC drivers.

Compute Nodes

Compute nodes execute the query execution plan and transmit data among themselves to serve these queries. The intermediate results are sent to the leader node for aggregation before being sent back to the client applications.

Node Slices

A compute node is partitioned into slices. Each slice is allocated a portion of the node’s memory and disk space, where it processes a portion of the workload assigned to the node. The leader node manages distributing data to the slices and allocates the workload for any queries or other database operations to the slices. The slices then work in parallel to complete the operation. The node size of the cluster determines the number of slices per node. The next figure shows the Amazon Redshift data warehouse architecture, including the client applications, JDBC and ODBC connections, leader node, compute nodes, and node slices.

Amazon Redshift architecture

Node Slices

Databases

A cluster contains one or more databases. User data is stored on the compute nodes.

Hardware Platform Options

When you launch a cluster, one option you specify is the node type. The node type determines the CPU, RAM, storage capacity, and storage drive type for each node. There are two categories for node types. The dense storage (DS) node types are storage-optimized using large magnetic disks and can provide up to 2 PB of storage capacity. The dense compute (DC) node types are compute-optimized. Because they use solid state drive (SSD) storage, they deliver much faster I/O compared to DS node types but provide less storage space at a maximum of 326 TB.

Table Design

Each database within an Amazon Redshift cluster can support many tables. Like most SQL-based databases, you can create a table using the CREATE TABLE command. This command specifies the name of the table, the columns, and their data types. This command also supports specifying compression encodings, distribution strategy, and sort keys in Amazon Redshift.

Data Types

Each value that Amazon Redshift stores or retrieves has a data type with a fixed set of associated properties. Data types are declared when tables are created. Additional columns can be added to a table by using the ALTER TABLE command, but you cannot change the data type on an existing column. Many familiar data types are available, including the following:

Numeric data types

  • BIGINT
  • DECIMAL
  • DOUBLE PRECISION
  • INTEGER
  • REAL
  • SMALLINT

Text data types

  • CHAR
  • VARCHAR

Date data types

  • DATE
  • TIMESTAMP
  • TIMESTAMPTZ

Logical data type

  • BOOLEAN

Compression Encoding

Amazon Redshift uses data compression as one of the key performance optimizations. When you load data for the first time into an empty table, Amazon Redshift samples your data automatically and selects the best compression scheme for each column. Alternatively, you can specify your preferred compression encoding on a per-column basis as part of the CREATE TABLE command.

Distribution Strategy

When you load data into a table, Amazon Redshift distributes the rows of the table to each of the compute nodes according to the table’s distribution style. When you execute a query, the query optimizer redistributes the rows to the compute nodes as needed to perform any joins and aggregations. The goal in selecting a table distribution style is to minimize the impact of the redistribution step by locating the data where it needs to be before the query is executed.

This is one of the primary decisions when you’re creating a table in Amazon Redshift. You can configure the distribution style of a table to give Amazon Redshift hints as to how the data should be partitioned to meet your query patterns. The style you select for your database affects query performance, storage requirements, data loading, and maintenance. By choosing the best distribution strategy for each table, you can balance your data distribution and significantly improve overall system performance. When creating a table, you can choose among one of the three distribution styles: EVEN, KEY, or ALL.

EVEN distribution Rows are distributed across the slices in a round-robin fashion, regardless of the values in any particular column. It is an appropriate choice when a table does not participate in joins or when there is not a clear choice between KEY distribution or ALL distribution. EVEN is the default distribution type.

KEY distribution Rows are distributed according to the values in one column. The leader node attempts to place matching values on the same node slice. Use this style when you will be querying heavily against values of a specific column.

ALL distribution A copy of the entire table is distributed to every node. This ensures that every row is collocated for every join in which the table participates. This multiplies the storage required by the number of nodes in the cluster, and it takes much longer to load, update, or insert data into multiple tables. Use this style only for relatively slow-moving tables that are not updated frequently or extensively.

Sort Keys

Another important decision to make during table creation is choosing the appropriate sort key. Amazon Redshift stores your data on disk in sorted order according to the sort key, and the query optimizer uses sort order when it determines the optimal query plans. Specify an appropriate sort key for the way that your data will be queried, filtered, or joined.

The following are some general guidelines for choosing the best sort key:

  • If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key.
  • If you do frequent range filtering or equality filtering on one column, specify that column as the sort key.
  • If you frequently join a table, specify the join column as both the sort key and the distribution key.

Loading Data

Loading large datasets can take a long time and consume many computing resources. How your data is loaded can also affect query performance. You can reduce these impacts by using COPY commands, bulk inserts, and staging tables when loading data into Amazon Redshift.

The COPY command loads data in parallel from Amazon S3 or other data sources in a more efficient manner than INSERT commands.

Querying Data

You can query Amazon Redshift tables by using standard SQL commands, such as using SELECT statements, to query and join tables. For complex queries, you are able to analyze the query plan to choose better optimizations for your specifi c access patterns. For large clusters supporting many users, you can confi gure workload management (WLM) to queue and prioritize queries.

Snapshots

Amazon Redshift supports snapshots, similar to Amazon RDS. You can create automated and manual snapshots, which are stored in Amazon S3 by using an encrypted Secure Socket Layer (SSL) connection. If you need to restore from a snapshot, Amazon Redshift creates a new cluster and imports data from the snapshot that you specify. When you restore from a snapshot, Amazon Redshift creates a new cluster and makes it available before all of the data is loaded so that you can begin querying the new cluster immediately. Amazon Redshift will stream data on demand from the snapshot in response to active queries and load all the remaining data in the background. Achieving proper durability for a database requires more effort and more attention. Even when using Amazon Elastic Block Store (Amazon EBS) volumes, take snapshots on a frozen fi le system to be consistent. Also, restoring a database might require additional operations other than restoring a volume from a snapshot and attaching it to an Amazon EC2 instance.

Security

Securing your Amazon Redshift cluster is similar to securing other databases running in the AWS Cloud. To meet your needs, you will use a combination of IAM policies, security groups, and encryption to secure the cluster.

Encryption

Protecting the data stored in Amazon Redshift is an important aspect of your security design. Amazon Redshift supports encryption of data in transit using SSL-encrypted connections. You can also enable database encryption for your clusters to help protect data at rest. AWS recommends enabling encryption for clusters that contain sensitive data. You might be required to use encryption depending on the compliance guidelines or regulations that govern your data. Encryption is an optional setting, and it must be confi gured during the cluster launch. To change encryption on a cluster, you need to create a new cluster and migrate the data.
Amazon Redshift automatically integrates with AWS KMS.

Implement security at every level of your Amazon Redshift architecture, including the infrastructure resources, database schema, data, and network access.

Access to Amazon Redshift resources is controlled at three levels: cluster management, cluster connectivity, and database access. For details on the controls available to help you manage each of these areas, see the Amazon Redshift Cluster Management Guide at https://docs.aws.amazon.com/redshift/latest/mgmt/welcome.html.

The following are some best practices for securing your Amazon Redshift deployments:

  • Enable and use SSL when connecting to the Amazon Redshift database port.
  • Ensure that your data is available only via SSL by setting the require_ssl parameter to true in the parameter group that is associated with the cluster.
  • Use long, random database passwords generated by Amazon Redshift and store them by using a secret management system.
  • Enable cluster encryption.
  • Secure the S3 bucket by enabling Amazon S3 encryption and configuring access control for Amazon S3.
  • Secure the ETL system by enacting access control, auditing/logging, patch management, disk encryption/secure deletion, and SSL connectivity to Amazon S3.
  • Secure the BI system by enacting access control, auditing, patching, SSL connectivity to Amazon Redshift, and SSL UI (if applicable).
  • Use cluster or VPC security groups to limit Amazon Redshift access only to the necessary IP addresses (for both inbound and outbound flows).
  • Enable cluster encryption.

Amazon Redshift Spectrum

Amazon Redshift also includes Redshift Spectrum , allowing you to run SQL queries directly against exabytes of unstructured data in Amazon S3. No loading or transformation is required.

You can use many open data formats, including Apache Avro, CSV, Grok, Ion, JSON, Optimized Row Columnar (ORC), Apache Parquet, RCFile, RegexSerDe, SequenceFile, TextFile, and TSV. Redshift Spectrum automatically scales query compute capacity based on the data being retrieved, so queries against Amazon S3 run fast, regardless of dataset size. To use Redshift Spectrum, you need an Amazon Redshift cluster and a SQL client that’s connected to your cluster so that you can execute SQL commands. The cluster and the data files in Amazon S3 must be in the same AWS Region.