Amazon EMR

Amazon EMR is Amazon’s offering for running large-scale distributed workloads in the cloud using open-source projects like Apache Hadoop, Apache Spark, Apache Hive, Apache Presto, Apache Pig, and a few others. We’ll look into the details of some of these projects later in the chapter.

Apache Hadoop Overview

Apache Hadoop is an open-source project at www.hadoop.apache.org that allows distributed processing of large datasets across clusters of computers using simple programming models. It was designed with linear scalability, high availability, and fault tolerance in mind. Apache Hadoop is one of the most popular open-source projects and is a key part of Amazon EMR.

The cofounders of Apache Hadoop, Doug Cutting and Mike Cafarella, credit the creation of Hadoop to the two papers published by Google. “The Google File System” was published in 2003, whereas “MapReduce: Simplified data processing on large clusters” was published in 2004. Hadoop 0.1 was released in April 2006 and has since revolutionized the area of large-scale data processing.

Apache Hadoop includes the following major modules:

Hadoop Common These are the common utilities in the Apache Hadoop project that support other Hadoop modules.

Hadoop Distributed File System (HDFS) HDFS is a distributed file system that provides high-throughput access to the data, with inherent data replication (3x by default).

Hadoop MapReduce MapReduce was the original default processing framework for Hadoop, later spun out into YARN (resource handling) and MapReduce (processing).

YARN YARN stands for Yet Another Resource Negotiator and is a framework to schedule jobs and manage resources across a Hadoop cluster.

Apache Hadoop became a popular data processing framework with lots of promise to replace traditional large-scale data processing systems, which were not only inefficient but also expensive. While the technology promised so much in terms of scale and performance, its usability left a lot to be desired. Traditional enterprises looking to replace their traditional in-house data processing systems were running many different workloads, and retrofitting them on Hadoop via MapReduce seemed quite challenging if not impossible. For example, among the most common users of traditional data processing systems were analysts who were more familiar with data flow languages and SQL, and writing MapReduce in Java programming language was not their cup of tea. This led to the creation of projects like Apache Hive and Apache Pig.

In addition to that, data scientists wanted to use this scalable platform for machine learning, but it became challenging to develop, train, and deploy models on this platform, which lead to projects like Apache Mahout and Spark ML. With the abundance of so many open-source projects, the landscape became very busy and it became challenging to deploy a Hadoop platform with the necessary projects as each project team was working independently with their own release cycles. This created an opportunity for companies like Hortonworks, Cloudera, MapR, and the traditional data warehouse vendors like Teradata, IBM, and others to come up with Hadoop distributions, which would package the most common projects and add additional layers to deploy, secure, and manage the ecosystem of projects. The original intent of Hadoop was to crunch large datasets in a scalable manner. However, deploying the platform in an on-premises environment where scalability is a challenge and adding and removing hardware can be problematic; the public cloud is the best place to deploy a Hadoop platform. In a public cloud like AWS, you can scale to any number of resources, spin up/down clusters in minutes, and pay based on the actual utilization of resources.

While you can run any of the managed distributions on Hadoop, Amazon offered Amazon EMR (Elastic MapReduce) as a managed Hadoop distribution on the cloud, with the intent to make it easier to deploy, manage, and scale Hadoop on the Amazon platform. The overarching goal of EMR was to combine the integration and testing rigor of commercial Hadoop distribution with the scale, simplicity, and cost effectiveness of the cloud.

Amazon EMR Overview

Amazon EMR is a managed Hadoop and Spark environment, which allows you to launch Hadoop and Spark clusters in minutes without the need to do node provisioning, cluster setup, Hadoop configuration, or cluster tuning. At the time of this writing, EMR has 21 open-source projects that are tested and integrated into a single service designed to work seamlessly. One of the core requirements from customers is to work with the latest releases of Hadoop but ensure that it is tested for an enterprise-grade deployment. Each project in EMR is updated within 30 days of a version release, ensuring that you have the best and latest releases from the community and they have been tested for enterprise-grade installations. Amazon EMR enables you to provision one, hundreds, or thousands of compute instances in minutes. You can use autoscaling to have EMR automatically scale your Hadoop and Spark clusters to process data of any size and back down when your job is complete to avoid paying for unused capacity.

With S3 being the core storage engine for your data lake, Amazon EMR processes data stored in an S3 data lake, which is designed to deliver 99.999999999 percent durability, with data automatically being distributed to three physical facilities (AZs) that are geographically separated within an AWS region. Amazon EMR monitors your cluster, will retry failed tasks, and will automatically replace poorly performing instances. You can monitor your cluster using Amazon CloudWatch to collect and track metrics, log files, set alarms, and automatically react to changes in the cluster. Amazon EMR provides multiple levels of security for your Hadoop and Spark clusters, including network isolation using Amazon VPC, encryption of data at rest in S3 using keys controlled via AWS KMS (Key Management Service) or customer-managed keys. Amazon EMR also includes security that uses ML models to discover, classify, and protect sensitive data using Amazon Macie, with the ability to control access and permissions using IAM and authentication with Kerberos. Amazon EMR makes deployments of a Hadoop cluster easy by providing you with the ability to launch a Hadoop or a Spark cluster within minutes. You don’t have to worry about provisioning, cluster setup, Hadoop configuration, or cluster tuning. It enables you to provision hundreds or thousands of compute instances in minutes. Amazon EMR can use Autoscaling to automatically scale up your Hadoop or Spark clusters to process data of any size and can scale it back down when you job is complete. This helps you to avoid paying for any unused capacity and thus save cost. You can also save on running your long-running Hadoop clusters by committing to reserved instances and saving up to 75 percent on the cost, or for intermittent workloads you can utilize the spare compute capacity and save up to 90 percent with spot instances. With Amazon EMR, the cost savings add up due to a variety of factors:

Less admin overhead to manage and support a Hadoop cluster
No up-front costs for hardware acquisition or software installation
Reduced operating expense from saving on data center space and power and cooling expenses
Faster business value due to reduced cost of delays and better competitive abilities by having access to the latest and greatest software, and an improved security in the cloud due to the needs of the most security conscious organizations

Apache Hadoop on Amazon EMR

I discussed Apache Hadoop briefly earlier in the chapter. Let’s look into the architecture of Hadoop, the key use cases, and the projects in the ecosystem like Hive, Tez, Pig, and Mahout and the need for Apache Spark. Hadoop works in a master-slave architecture for distributed storage and computation, with the master node called a name node and the slave node(s) called data node(s). The architecture originated from the massively parallel processing (MPP) world of large-scale data processing, and some of the databases like Teradata were already running on similar architectural frameworks. The objective of an MPP architecture was to split the storage and compute across multiple machines (data nodes), and have them managed through a single master node (name node). The name node (master node) is responsible for managing the namespace such as the entire filesystem metadata, and also the location of the nodes and the block location(s) for a particular file. This information is reconstructed from the data nodes at system startup time. Due to the importance of the name node, considering it has the location map of the files and blocks, it is important to make it resilient to failures. This is typically achieved either by making a backup of the filesystem metadata or by running a secondary name node process (typically on a separate machine). A secondary name node, however, does not provide high availability in case of a failure. You will need to recover the failed name node by starting a new name node and use the metadata replicas to rebuild the name node, a process that can take between 30 minutes and one hour.

Data nodes are the actual workers in the cluster, who store and retrieve the actual data and communicate with the name node to keep the namespace updated at the name node level.

The major components that we discussed earlier are as follows:

Hadoop Distributed File System (HDFS) This is an open-source filesystem that would manage storage across a number of machines, each containing a portion of the dataset. The objective of HDFS was to manage large datasets across sets of commodity machines, where failure was a norm rather than an exception. HDFS was optimized for high throughput of data, on commodity hard disks, which comes at cost—the latency of your queries. HDFS is not designed for queries that require low latency access to the data. The figure shows the HDFS architecture and anatomy of a client read and client write.

HDFS architecture

HDFS

During the exam, if you come across a scenario demanding low latency access or a large number of small files, HDFS will not be an option. You are better off working with tools like HBase when such a requirement is demanded from the situation.

Also, the filesystem metadata is kept in memory by the name node, which means that small files can be a huge problem for Hadoop. I’ve seen cases in practice where the name node runs out of memory, resulting in the crash of the entire cluster when storing metadata of lots of small files. When you talk about small files, a common question is about the recommended file size on HDFS. It is important to first understand how the filesystem is built before the answer can be given. Since IO is one of the most important factors that contributes to latency, typical filesystems deal with blocks of data to optimize disk reads and writes. A block size is the minimum amount of data that can be read or written, and while the disk block sizes are 512 bytes, most filesystems will deal with block sizes of a few kilobytes. HDFS also uses a block size, but the default block size is 128 MB. The reason HDFS has large block sizes is that it was intended to operate on large volumes of data, and hence you would like to minimize the cost of seeks and fetch more data with each individual IO. HDFS was built with the intent of being run on commodity hard disks, where failure was a norm rather than the exception, and hence it provides a default 3x replication (configurable), which means each block written to HDFS is replicated to a physically separate set of machines for fault tolerance and high availability. When a client application seeks to read data, HDFS decides which copy of the data to read depending on the availability of the block.

MapReduce MapReduce was the de facto programming model for data processing on Hadoop before Hadoop 2.0, when it spun out Apache YARN (which will be discussed in a bit more detail shortly). A MapReduce job is basically a solution to a problem, such as, for example, a data transformation job or some basic analytics that an application (or a client) wants to perform on the cluster (typically on the data stored on HDFS or any other shared storage) in the MapReduce programing model. A Map Reduce job is divided into two major tasks: Map and Reduce. These tasks are scheduled across the nodes on the cluster by YARN, which not only schedules the tasks but works with the Application Master (we’ll look into this in a bit more detail in a later section) to reschedule the failed tasks across the cluster.

Mapper tasks write the output to the local disk and not to HDFS, as the output is an intermediate output, and it is then picked up by the reducer phase for final output. The mapper output is thrown away once the job is completed. Generally speaking, the reduce tasks receive data from multiple mapper tasks, and the output is stored to HDFS for persistence. The number of mapper tasks depends on the number of splits (an input split is defined by Hadoop), whereas the number of reduce tasks is defined independently and not directly dependent on the size of the input. Choosing the number of reducers is considered to be an art rather than a science, as having too few reducers can impact the overall parallelism, thus slowing the job, and having too many reducers may create lots of smaller files and lots of inter mapper-reducer traffic (the shuffle/sort).

The next figure depicts the anatomy of a MapReduce job and the major steps involved like converting input data to splits, running a mapper phase, using a combiner, shuffle and sort before running a reduce phase to create an output data to be stored on HDFS.

Anatomy of a MapReduce job

MapReduce

YARN (Yet Another Resource Negotiator) YARN, as the name indicates, is a project that was created to manage the negotiation process between applications running on the Hadoop platform and the cluster resources. The objective was to split the resource management and computing framework from MapReduce v1.0 so that scheduling can be done independently of compute, thus paving the way for additional compute engines on the platform like Apache Tez and Apache Spark. The next figure shows how frameworks like MapReduce, Spark, and Tez can benefit from YARN’s resource negotiation capabilities. Applications like Apache Hive, Apache Pig, and Apache Crunch can make use of the lower-level frameworks rather than natively integrating with YARN.

YARN applications

YARN

YARN has two types of long-running daemons:

A resource manager daemon – This daemon process is run one per cluster, and as the name indicates it is used to manage the resources of the cluster.
A node manager daemon – This daemon is run on all the nodes in the cluster and is used to launch new containers and monitor existing containers. A container is basically an allocation of a set of resources from the cluster in terms of memory, CPU, and so on. For example, you might have a container that is allocated two cores of CPU, and 128 GB of RAM.

A YARN application follows these steps:

The client contacts the resource management daemon and asks for it to run an Application Master.
The resource manager has a full view of resource availability across the cluster and finds a node manager that can launch the Application Master in a container.
The Application Master’s next steps are dependent on the Application Master’s intended purposes:
- Run the application inside the container it itself is running in, compute the results, and send it back to the client.
- Request (like a MapReduce YARN application) the resource manager for additional containers and run the application in a distributed fashion. YARN applications can have a short-lived lifespan, running an application for a few seconds to minutes, or it can live for the duration of the cluster.

Hadoop on EMR

Now that you have seen the different components in Apache Hadoop, let’s look at what is offered by AWS within EMR. Since Amazon offers EC2 (Elastic Compute Cloud) instances, an EMR cluster is in fact a collection of EC2 instances, with each one of them having different roles, like master, core, and task nodes. The roles are based on the software components that get installed on the EC2 instances. While on Apache Hadoop, you only have a name node and a data node, but in EMR you have three different node types:

Master nodes: The master node is basically the name node of the cluster, and similar to a name node, it manages the cluster by running various software components to coordinate the distribution of data and tasks among other nodes for processing. The master node is a mandatory node in the cluster, and it is possible to create a single-node cluster with only one master. The master node runs the YARN Resource Manager service to manage resources for applications as well as the HDFS NameNode service.
Core nodes: The core node performs the role of a data node in a Hadoop cluster; it acts as a workhorse to run the tasks and also stores the data in HDFS. If you have a multi-node cluster, it is mandatory to have a core node. Core nodes run the DataNode daemon to coordinate storage as a part of HDFS. They also run the Task Tracker daemons and perform other parallel computation tasks.
Task nodes: The task nodes also perform the role of a workhorse but are optional in nature as they do not store any data in HDFS. They are added to the cluster to add power and perform parallel computation on the data, such as running Hadoop MapReduce tasks and Spark executors. Task nodes do not run the DataNode daemon. Task nodes are often used with spot instances, and hence Amazon EMR provides the default functionality that when a task node running on spot instance is terminated, the running jobs don’t fail. It is for this specific reason that the Application Master process is only run on the core nodes.

Types of EMR Clusters

Amazon EMR decouples storage and compute, which allows you to run different workloads on differently configured clusters and make the best use of the resources available while identifying the right price point for different workloads based on SLAs.

Persistent cluster: Traditionally Hadoop on-premises clusters have always been persistent clusters that are long running. If you have a workload that requires a 24x7 cluster availability, long-running persistent clusters are the way to go.
Transient clusters: Transient clusters are suitable for batch jobs. An example of a batch job is a nightly job that extracts data from some source systems and integrates it into a data mart that is then made available for querying. The storage and compute separation of EMR makes this possible and reduces the overall cost of running Hadoop on AWS.
Workload-specific clusters: Since there are various versions of Hadoop, you can optimize your cluster to a specific workload. Running a workload-specific Amazon EMR cluster can give you the best performance while providing optimal cost. You can tweak the settings of your container based on individual workload needs.

What Instance Types to Choose

Amazon EMR offers a variety of instances types, and it can become quite complicated to pick the right instance type for your workloads.

Typically, there are four major types of workloads that are run on EMR, as described in next table. The table is important as you might be asked to identify the best instance family for a workload type.

Amazon EMR instance types

Workload Type	Instance Recommendation
General purpose – batch processing	M5 family M4 family
Compute intensive – machine learning	C5, C4m, z1d family
Memory intensive – interactive analysis	X1, R4, R5a, R5d family
Storage intensive – large HDFS requirements	D2, I3 family
Deep learning – GPU instances	G3, P2 family

Creating an EMR Cluster

You can create an EMR cluster in three different ways: with the AWS Management Console, AWS CLI, and the Java SDK.

AWS Management Console

Please watch the video at bit.ly/2RXBguQ for a demonstration on the creation of an EMR cluster. You should practice this by doing Exercise 4.1 for a quick setup, or follow the advanced options to have more control over how the cluster must be created.

AWS CLI

You can create the same cluster from a CLI with the following command:

aws emr create-cluster
 --name "demo-cluster"
 --release-label emr-5.28.0
 --applications Name=Hive Name=Spark
 --use-default-roles
 --instance-groups
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m5.xlarge
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m5.xlarge

AWS SDK

You can create the same cluster using the Java SDK. Please follow the example in AWS Documentation to get an idea of how to create such a cluster (amzn.to/2qWJdpb).

The exam will not test you on the exact semantics of creating an EMR cluster from the CLI or programmatically. This knowledge is, however, good from an architecture perspective, and it’s good to know that you can programmatically create an EMR cluster in a data pipeline.

During the cluster creation, especially in the console, you might have seen various options like EMR MultiMaster and Glue Data Catalog. Let’s take a quick look at what those mean:

EMR MultiMaster – As discussed earlier during the Hadoop overview, the name node (also known as the master node) can be a single point of failure in a Hadoop cluster. EMR MultiMaster allows you to create a cluster with three master nodes, which therefore supports high availability by automatically failing over to the standby master node if the primary master node fails.
Glue Data Catalog settings – Glue Data Catalog can act as a central catalog for your platform. EMR allows you to use Glue Data Catalog as an external metastore, which once configured will be supported in Hive, Spark, and Presto. This is a very useful option when you need to share the metadata between different clusters, services, applications, and AWS accounts
Editing software settings – Another option that you might have seen during the cluster creation is Edit Software Settings. You can edit the default configurations for Hadoop applications by supplying a configuration object, which helps in configuring software on your EMR cluster. Configuration objects consist of a classification, properties, and optional nested configurations. The benefit of this option is a simpler way to edit the most common Hadoop configuration files, including hive-site, hadoop-env, hadoop-log4j, and core-site
Adding steps – A step is a unit of work you can submit to the cluster, and it can be one of the following:
- Streaming program
- Hive program
- Pig program
- Spark application
- Custom JAR

You also have the option to auto-terminate the cluster upon completion. Longrunning clusters have the option of enabling termination protection, which protects the cluster from accidental termination.

Instance group configuration: You have the option to provision your cluster from a list of instance types, which can be acquired in an on-demand fashion or from the spot market. You can choose one of the following configurations:
- Uniform instance groups – You can specify an instance type and purchasing option for each node type. This is the default selection.
- Instance fleets – You have the option to specify a target capacity and Amazon EMR will fulfill it for each node type. You can specify a fleet of up to five EC2 instance types for each Amazon EMR node type (master, core, task). As a best practice you should be flexible on the kind of instances your application can work with, as that gives you the best chance to acquire and maintain spot capacity for your EMR cluster.
Logging – Logging is one of the most important aspects considering anything that goes wrong in your cluster can be identified from the error logs. By default, logs are written to the mater node in /mnt/var/log for the following components:
- Step logs
- Hadoop and YARN component logs
- Bootstrap action logs
- Instance state logs

If you have checked the logging box while configuring your cluster, the logs will also be written to S3 on a 5-minute interval.

Debugging – When debugging is enabled, Amazon EMR will archive the log files to Amazon S3 and the files will be indexed. You can then use the console to browse the step, job, task, and task-attempt logs on the cluster. The debugging logs are also pushed to S3 at a 5-minute interval.

EMRFS

EMRFS is an implementation of HDFS used by Amazon EMR clusters to read and write data to Amazon S3. EMRFS provides the ability to store data onto S3 in addition to providing features like consistent view and data encryption while supporting IAM role authentication and pushdown optimization with S3 SELECT. As discussed earlier, data in S3 is eventually consistent. Amazon EMRFS provides consistent views that provide consistency checks for lists and read-after-write for objects in Amazon S3. Furthermore, EMRFS provides data encryption, which means you can encrypt and work with encrypted objects EMRFS. EMRFS consistent view uses DynamoDB as a file registry and provides configuration options for the following:

Number of times EMRFS calls S3 after finding an inconsistency
Amount of time until the first retry. Subsequent retries will use an exponential back-off.

Bootstrap Actions and Custom AMI

Bootstrap actions are scripts that can be executed on a Hadoop cluster before Hadoop daemons start up on each node. They are typically used for installing additional software, and Amazon EMR allows you to run up to 16 bootstrap actions. You can run the bootstrap actions based on certain conditions, as in the following examples:

RunIf – Amazon EMR provides the RunIf predefined bootstrap action to run a command when an instance-specific value is found in any of the following files:
instance.json
job-flow.json

An example of a condition statement is isMaster=true, which means that the action is run if the node running the action has a master role in the cluster

Custom - Run a custom script such as to copy data from S3 to each node.

You can also use a Custom AMI, which can reduce the cluster start time by pre-installing applications and performing other customizations instead of using bootstrap actions. This also prevents unexpected bootstrap action failures. There are certain limitations for a custom AMI:

Must be an Amazon Linux AMI
Best be an HVM- or EBS-backed AMI
Must be 64-bit AMI
Must not have users with the same names as used by Hadoop applications (for example, hadoop, hdfs, yarn, or spark)

Security on EMR

An EC2 key pair is needed enable SSH into master node. You also have a check box, which indicates the visibility of the cluster to parties other than the creator. If the check box is unchecked (checked by default), the cluster would only be visible to the creator of the cluster in the console and CLI.

Permissions for EMR

Amazon EMR requires three different roles for operating in the platform:

EMR role – Allows EMR to access resources such as EC2.
EC2 instance profile – Allows EC2 instances in a cluster to access resources such as S3.
Auto scaling role – Allows autoscaling to add and terminate instances.

Amazon EMR can create default roles for you while creating a cluster. If the roles are already available, they will be used, and if they are unavailable, the roles would be created. You can also specify custom roles.

Security Configurations

You can also define security configurations to configure data encryption, Kerberos authentication, and Amazon S3 authorization for EMRFS. The security configuration can be used for multiple EMR clusters. Security configurations can be created via AWS Console, AWS CLI, and AWS SDK and can also be created with a cloud formation template. For data encryption, you can specify the following:

Encryption of data at rest
Encryption of data in motion/transit

Amazon EMR has different places for data at rest, including local discs and S3 via EMRFS. You can also choose in-transit encryption by choosing options for open-source encryption features that apply to in-transit data for specific applications. The available in-transit encryption options may vary by EMR release.

For authentication, you can enable Kerberos authentication for interactions between certain application components on your cluster using Kerberos principals. For the majority of the development and test workloads and some production workloads, you can choose between having EMR install a KDC server on the master node of the cluster or configuring an External KDC. For production workloads, customers often prefer using an External KDC which requires you sharing the KDC details with the EMR cluster. You can also configure Lake Formation integration and use corporate credentials together with Lake Formation permissions to control access to the data catalog and underlying data store. You must create a data lake and configure associated permissions in Lake Formation.

EMR Notebooks

Jupyter notebooks have become one of the most common ways for data engineers and data scientists to work with the data. While you can configure a Jupyter notebook inside an EMR cluster, AWS announced the feature of an EMR notebook, which meant you can use EMR notebooks based on Jupyter to analyze data interactively with live code, narrative text, visualizations, and more. You can create and attach notebooks to Amazon EMR clusters running Hadoop, Spark, and Livy. Notebooks run free of charge and are saved in Amazon S3 independently of clusters. When you create a notebook in Amazon EMR, you can choose to connect to an existing cluster or create a new cluster. Creating a new cluster from the Notebook interface is typically done in dev/test scenarios. You can also link the notebook to a Git repository.

Ejercicios

23

AWS Containers

Amazon EMR

Apache Hadoop Overview

Amazon EMR Overview