Skip to content
English
On this page

Amazon Athena

Amazon Athena is an interactive querying service that makes it easy to analyze data in S3 using standard SQL. It is serverless in nature, which means that you don’t have to manage and maintain any infrastructure but rather you can start analyzing data immediately. Amazon Athena supports the schema-on-read concepts, which is essential in this age of growing semi-structured and multi-structured data. It supports ANSI SQL operators and functions, which allows you to port queries from other analytical tools. Amazon Athena is massively scalable, which means that you don’t have to manually scale the infrastructure when the data size grows. It also allows you to run cross-region queries and manage your costs.

Amazon Athena has the following key benefits:

  • Query data directly on S3 and other sources – Amazon Athena enables data engineers, data analysts, and data scientists to execute SQL queries against data stored in relational, nonrelational, and custom data sources. Before November 2019, Athena was primarily used to query data from S3 (an object store); however, with the announcement of Athena Federated Query (amzn.to/2YtsR3u), Amazon Athena can now be used to query multiple data sources using built-in and custom connectors.

  • Serverless – Amazon Athena is serverless, and hence you don’t have to manage any infrastructure. You don’t need to manage configuration, software updates, failures, or infrastructure scalability. Users can focus on the core business problems rather than the infrastructure required.

  • Highly performant – Amazon Athena provides superior performance by accessing the data in a parallel fashion.

  • Pay-as-you-go model – Amazon Athena has a simple pay-as-you-go model where you are charged per query, and it offers best practices to reduce the cost of the query by using optimal file formats, compression and partitioning of data.

Amazon Athena uses Apache Presto for SQL queries and Apache Hive for DDL functionality behind the scenes.

Apache Presto

Apache Presto is an open-source distributed SQL query engine optimized for low latency and ad hoc analysis of the data. It originated at Facebook due to the growing needs of their data analytics team and was later open sourced. Apache Presto can be used to query multiple terabytes to petabytes of data and provides an extensible architecture. It is optimized for low-latency interactive querying and provides cross-platform querying capability rather than just SQL on a single non-relational store. Apache Presto has a master-slave architecture (similar to Hadoop and some MPP databases), where the coordinator acts as the master of the cluster and the workers are slaves. The figure shows the Apache Presto architecture.

Apache Presto architecture

Presto

The key components in the architecture are as follows:

  • Coordinator – The brains of a Presto cluster. Clients connect to the coordinator to submit their queries. The coordinator has three major subcomponents:

    • Parser – Responsible for parsing the queries coming from the client and parses the query to identify any syntax issues
    • Planner – Responsible for optimizing the plan for running the work across the workers. It uses rule-based and cost-based optimization techniques.
    • Scheduler – The component that tracks the worker nodes’ activities and the actual execution of the query
  • Worker – The component responsible for scheduling tasks on the worker nodes. Worker nodes are the actual workhorses of a Presto installation. They are able to connect to the data sources using the connectors. The coordinator is responsible to fetch results from worker nodes and return them to the client.

  • Connectors – Being a distributed query engine, Presto provides various connectors that make it easy to connect to different data sources. It has a number of built-in connectors for common data sources like Hive, Kafka, MySQL, MongoDB, Redshift, Elasticsearch, SQL Server, Redis, and Cassandra, including many others. However, you also have the ability to develop custom connectors.

Presto provides superior performance due to the following reasons:

  • Data is piped across many workers in a massively parallel processing (MPP) fashion in multiple stages and data is streamed across multiple stages.
  • Multi-threaded execution makes the processing across multiple workers faster.
  • Presto uses flat-memory data structures, which minimize garbage collection and optimize processing.
  • Presto provides optimized readers for commonly used columnar formats like Parquet and ORC.
  • Presto uses fast query optimization techniques.

While using Athena, you are abstracted from the underlying details of Apache Presto. However, it is generally a good idea to know the underlying architecture. Furthermore, Apache Presto is also a project available with Amazon EMR, and hence understanding the background and when to use plain-vanilla Apache Presto versus when to use Amazon Athena, a serverless alternative, is important from an architectural standpoint.

Apache Hive

Amazon Athena uses Apache Hive for data definition language (DDL) functionality behind the scenes. Apache Hive was a project built by Jeff Hammerbacher’s team at Facebook, with the intent of providing a framework for data warehousing on top of Apache Hadoop. Facebook was working on large-scale problems around social networks and association graphs, and the bulk of the data was stored on Hadoop Distributed File System (HDFS). Since the primary processing framework on Hadoop was MapReduce, which required extensive Java skills and thinking of problems in terms of key/value pairs, it became difficult to analyze the data for analysts who were more comfortable with the SQL skillset. Hive framework was created for the analysts to continue writing SQL but have the framework transform the SQL into a MapReduce paradigm. SQL as a language is ideal for set processing and is well known in the industry, which made Hadoop as a project become more popular and much widely used. Hive supports Hive-QL (a SQL-like language) and allows you to run the SQL in either a MapReduce or Apache Tez paradigm. Hive supports primitive types like integers, floating-point numbers, and string, data, time, and binary types in addition to complex types like structs, maps (Key/Value tuples), and arrays (indexable lists). Apache Hive supports multiple data formats like Apache Parquet and Apache ORC and partitioning of data for optimal process.

You can use Apache Hive for DDL functionality with Amazon Athena, but the complexity is abstracted from you as an end user.

Amazon Athena Use Cases and Workloads

Amazon Athena is most commonly used to analyze logs, be it from web servers or applications, discovering the data to understand and mine for patterns and then using Amazon QuickSight and other BI tools to perform ad hoc analysis. Amazon Athena has a distinct place within the AWS ecosystem, where you have multiple other tools for data processing and analysis. If you have set up a data lake based on S3, you can use Amazon Athena to understand the key relationships and patterns within the dataset and then use this knowledge to model these in the form of data marts within Amazon Redshift for analyzing historical data and running pre-canned reports. One of the common questions is the choice between Amazon EMR and Amazon Athena for data processing and analytics. While we will look at Amazon EMR in a lot more detail, on a high level, Amazon EMR offers a lot more than just SQL processing. It offers a number of different framework choices like Hive, Pig, Presto, Spark, HBase, and Hadoop. Amazon EMR is more suitable if your workload requires you to use custom Java/Python/Scala code and you would like more control over the optimization and configuration parameters to run your workload. Amazon EMR offers you a lot more control, which comes with additional responsibility but also is usable for a wider spectrum of use cases and offers multiple price points depending on your use cases. For example, Amazon EMR also allows you to benefit from spot instances, which can make your workload processing quite cheap compared to on-demand instances. We’ll look into Amazon EMR in a lot more detail later in this chapter.

The key use cases for Athena are as follows:

  • Query data in your Amazon S3–based data lake.
  • Analyze infrastructure, operation, and application logs.
  • Interactive analytics using popular BI tools
  • Self-service data exploration for data scientists
  • Embed analytics capabilities into your applications.

Until November 2019, Amazon Athena was only limited to using a managed data catalog to store information about the databases and schemas for the data available in Amazon S3, and in the regions where AWS Glue was available, you could actually use the AWS Glue catalog, which is typically crafted using AWS Glue crawlers. In non–AWS Glue regions, Athena was using an internal data catalog, which was a hive-metastore–compliant store. Amazon Athena’s internal data catalog stores all metadata for tables and columns, is highly available, and doesn’t need user management. The catalog is compliant with hivemetastore and hence you can use hive queries for DDL. In November 2019, before the AWS re:Invent conference, Amazon introduced support for custom metadata stores, which was welcomed by the customers as it meant that you could then use a data source connector to connect Amazon Athena to any metastore of choice. Amazon Athena can now run queries that can scan data across Hive Metastore, Glue catalog, or any other federated data sources.

Amazon Athena DDL, DML, and DCL

Amazon Athena uses Apache Hive for data definition language (DDL) statements. In order to create a table in Amazon Athena, you can choose to do the following:

  • Run a DDL statement in Amazon Athena Console.
  • Run a DDL statement from a JDBC or an ODBC driver.
  • Using Amazon Athena, create a table wizard.

Amazon Athena will use the schema-on-read approach to project the schema when you execute a query, thus eliminating the need for the loading data or transformation. Athena will not modify your data in Amazon S3. When you are creating your schema such as databases/tables, you simply point to a location of the data from where it should be loaded at runtime.

As an example, the following code snippet shows you how to create a table in Amazon Athena.

sql
CREATE [EXTERNAL] TABLE [IF NOT EXISTS]
 [db_name.]table_name [(col_name data_type [COMMENT col_comment] [, ...] )]
 [COMMENT table_comment]
 [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
 [ROW FORMAT row_format]
 [STORED AS file_format]
 [WITH SERDEPROPERTIES (...)]
 [LOCATION 's3_loc']
 [TBLPROPERTIES ( ['has_encrypted_data'='true | false',]
['classification'='aws_glue_classification',] property_name=property_value
[, ...] ) ]

Details of the key parameters can be found in AWS Documentation at amzn.to/2LFRm8n. While the exam generally does not specifically expect you to know about each of these parameters, they are good to know when you are developing applications on Amazon Athena. Athena supports JSON, TXT, CSV, Parquet, and ORC formats via SerDes (Serializer/Deserializer is an interface used by Apache Hive for input and output and for interpreting the different data formats). Amazon Athena uses the same SerDes to interpret the data coming from Amazon S3.

Using columnar formats like Apache Parquet and ORC can dramatically improve the performance of your queries. Typically, analytical queries access a subset of the columns from the rows rather than the entire row. Using columnar formats can dramatically reduce the size of data you query, thus improving overall IO and CPU requirements, resulting in overall better performance and reduced cost.

The table indicates the performance improvement and cost reduction on a 1 TB (terabyte) dataset stored on Amazon S3 (amzn.to/2LKT0Wt).

Performance improvements with columnar formats

Data FormatSize on Amazon S3Query Run TimeData ScannedCost
Text files1 TB236 seconds1.15 TB$5.75
Apache Parquet130 GB6.78 seconds2.51 GB$0.013
Savings/Speedup87% less with Parquet34x faster99% less data scanned99.7% saving

Partitioning Your Data

Partitioning allows you to restrict the amount of data scanned by your Athena queries, thus providing better performance and reducing the overall cost. Partitioning can be done by any column in the table, and it is quite common to define multilevel partitions based on time. The next table compares runtimes of queries between partitioned and nonpartitioned tables. Both queries are supposed to operate on a 74 GB uncompressed dataset in text format.

Amazon Athena Workgroups

Amazon Athena is priced by the amount of data scanned on S3. You can use Amazon Athena workgroups to isolate queries between different teams, workloads, or applications and set the limits on the amount of data each query or an entire workgroup can process. Amazon Athena workgroups can also be used to isolate a workload, identify query metrics, and control the costs of your queries. You can identify a unique query output location per workgroup, encrypt the results with a unique AWS KMS key per workgroup, collect and publish aggregated metrics per workgroup in Amazon CloudWatch, and use workgroup settings to eliminate the need to configure individual users. Athena workgroups can also provide workgroup metrics reports like these:

  • Total bytes of data scanned per workgroup
  • Total successful/failed queries per workgroup
  • Total query execution time per workgroup

You can also define per-query data scan thresholds, exceeding which a query can be canceled. And you can trigger alarms to notify the user of increasing usage and cost and disable a workgroup when queries exceed a certain defined threshold.

Amazon Athena Federated Query

Amazon Athena supports federated query, which means you can run queries using Athena across relational, non-relational, object, or custom-data sources. The query can be run across on-premises or cloud data sources and can be used for ad hoc investigations and for building complex pipelines and applications.

Athena

Athena federated query is relatively simple to use; you can deploy a data source connector, register a connector, specify the catalog name, and write the SQL query to access the data. Athena uses Lambda-based data source connectors, which can be deployed in two ways:

  • One-click deploy using AWS Serverless Application repository
  • Deploy connector code to AWS Lambda.

You can also use registration-less federated query, which is really useful for quick prototyping. All you need to do is add the prefix "lambda:<function_name>". as the catalog name, and you can access the data source. For example, "SELECT * from "lambda:cmdb".ec2.ec2_instances". At the time of writing this book, the following connectors are available:

  • HBase
  • DocumentDB
  • DynamoDB
  • JDBC
  • Redis/ElastiCache
  • Amazon CloudWatch Logs
  • Amazon CloudWatch Metrics
  • TDPS Data Generator

You can also write your own data source connector with Amazon Athena Query Federation SDK. An example connector is available at bit.ly/2YEm0og. The Amazon Athena Query Federation SDK includes the following key features:

  • S3 spill
  • Partition Pruning
  • Parallel Scans
  • Portable columnar memory format (Apache Arrow)
  • Authorization
  • Congestion control/avoidance

Amazon Athena Custom UDFs

Amazon Athena now adds the ability to build custom user-defined functions (UDFs), which provide the ability to pre- or post-process the data, provide access controls, and add custom logic during data access. You can invoke UDFs in the select or filter phase of an Amazon Athena query. Amazon Athena custom UDFs provide the following features:

  • UDFs are powered by AWS Lambda.
  • UDFs support network calls.
  • UDF invocation can be used in the select and/or filter phase of the query.
  • Athena optimizes performance, thus allowing you to focus on the business logic.

Using Machine Learning with Amazon Athena

Amazon Athena allows you to invoke ML models for inference directly from SQL queries. You can use ML with your SQL analysis for anomaly detection, customer cohort analysis, and sales prediction from your standard analytical queries. The models available on SageMaker can be used for inference, and no additional setup is needed to invoke the models. Combining the federated query with ML makes Athena a very powerful asset in your analytical ecosystem. A typical workflow might have a data scientist using Amazon Athena federated query to select data from multiple data sources to understand the patterns and then perform data transformation using other tools or Athena UDFs before using this curated dataset to train a model on SageMaker and deploy it. Once the model is available on SageMaker, any user having access can use it for ML inference. Please follow the blog to get some hands-on experience with using Amazon Athena in your ML pipeline (amzn.to/2YBsc02). The following code snippet shows a sample query to invoke inference from Amazon Athena:

USING FUNCTION predict
 (platform int,
 genre int,
 critic_score int,
 user_score int,
 rating int)
 returns double TYPE_SAGEMAKER_INVOKE_ENDPOINT WITH
(Sagemaker_endpoint='myxgboostendpoint')

USING FUNCTION normalize_genre(value VARCHAR)
 returns int TYPE LAMBDA_INVOKE WITH (lambda_name='myLambdaFunction')

SELECT predict(platform,genre,critic_score, user_score,rating), name
FROM
 (SELECT name,
 normalize_genre(genre) as genre,
 critic_score,
 user_score
 FROM video_game_data.video_games);

Amazon Athena is a very powerful tool, used for ad hoc analysis, and with the new features on federated query and ML integration, it can be a very valuable tool for any data scientist.