Welcome to Certsleader, your ultimate source for top-quality Data-Engineer-Associate dumps tailored for Amazon Data-Engineer-Associate exam. Our comprehensive resources are designed to help you excel in your exam preparations and achieve your certification goals. Whether you are a beginner looking to start a career in Amazon or an experienced professional seeking to advance your skills, Certsleader has the right tools to support your journey.
Why Certsleader is Your Best Choice:
Expertly Curated Content: Our study materials are meticulously crafted and verified by a panel of IT experts, ensuring they are accurate, relevant, and up-to-date with the latest industry standards.
Real Exam Questions: Our resources include authentic Data-Engineer-Associate exam questions and detailed answers, allowing you to familiarize yourself with the exam format and question types, and practice effectively.
Comprehensive Study Guides: Each certification guide is designed to provide in-depth knowledge and understanding of the subject matter, helping you to grasp even the most complex concepts.
Convenient Access: Our study materials are available in easy-to-download PDF files, making it convenient for you to study anytime, anywhere, and on any device.
Guaranteed Success
At Certsleader, we are committed to your success. Our practice questions answers are designed to improve your knowledge and help you pass your exams on the first attempt with high scores. In the rare event that you do not succeed, we offer a full refund, taking responsibility for your satisfaction.
Start Your Journey with Certsleader
Join thousands of satisfied learners who have successfully passed their certification exams with Certsleader. Explore our study materials, download your PDF files, and take the first step towards a rewarding IT career today.
Amazon Data-Engineer-Associate Sample Questions
Question # 1
A data engineer needs Amazon Athena queries to finish faster. The data engineer noticesthat all the files the Athena queries use are currently stored in uncompressed .csv format.The data engineer also notices that users perform most queries by selecting a specificcolumn.Which solution will MOST speed up the Athena query performance?
A. Change the data format from .csvto JSON format. Apply Snappy compression. B. Compress the .csv files by using Snappy compression. C. Change the data format from .csvto Apache Parquet. Apply Snappy compression. D. Compress the .csv files by using gzjg compression.
Answer: C
Explanation: Amazon Athena is a serverless interactive query service that allows you to
analyze data in Amazon S3 using standard SQL. Athena supports various data formats,
such as CSV, JSON, ORC, Avro, and Parquet. However, not all data formats are equally
efficient for querying. Some data formats, such as CSV and JSON, are row-oriented,
meaning that they store data as a sequence of records, each with the same fields. Roworiented
formats are suitable for loading and exporting data, but they are not optimal for
analytical queries that often access only a subset of columns. Row-oriented formats also
do not support compression or encoding techniques that can reduce the data size and
improve the query performance.
On the other hand, some data formats, such as ORC and Parquet, are column-oriented,
meaning that they store data as a collection of columns, each with a specific data type.
Column-oriented formats are ideal for analytical queries that often filter, aggregate, or join
data by columns. Column-oriented formats also support compression and encoding
techniques that can reduce the data size and improve the query performance. For
example, Parquet supports dictionary encoding, which replaces repeated values with
numeric codes, and run-length encoding, which replaces consecutive identical values with
a single value and a count. Parquet also supports various compression algorithms, such as
Snappy, GZIP, and ZSTD, that can further reduce the data size and improve the query
performance. Therefore, changing the data format from CSV to Parquet and applying Snappy
compression will most speed up the Athena query performance. Parquet is a columnoriented
format that allows Athena to scan only the relevant columns and skip the rest,
reducing the amount of data read from S3. Snappy is a compression algorithm that reduces
the data size without compromising the query speed, as it is splittable and does not require
decompression before reading. This solution will also reduce the cost of Athena queries, as
Athena charges based on the amount of data scanned from S3.
The other options are not as effective as changing the data format to Parquet and applying
Snappy compression. Changing the data format from CSV to JSON and applying Snappy
compression will not improve the query performance significantly, as JSON is also a roworiented
format that does not support columnar access or encoding techniques.
Compressing the CSV files by using Snappy compression will reduce the data size, but it
will not improve the query performance significantly, as CSV is still a row-oriented format
that does not support columnar access or encoding techniques. Compressing the CSV files
by using gzjg compression will reduce the data size, but it willdegrade the query
performance, as gzjg is not a splittable compression algorithm and requires decompression
before reading. References:
Amazon Athena
Choosing the Right Data Format
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,
Chapter 5: Data Analysis and Visualization, Section 5.1: Amazon Athena
Question # 2
A company stores data in a data lake that is in Amazon S3. Some data that the company stores in the data lake contains personally identifiable information (PII). Multiple usergroups need to access the raw data. The company must ensure that user groups canaccess only the PII that they require.Which solution will meet these requirements with the LEAST effort?
A. Use Amazon Athena to query the data. Set up AWS Lake Formation and create datafilters to establish levels of access for the company's IAM roles. Assign each user to theIAM role that matches the user's PII access requirements. B. Use Amazon QuickSight to access the data. Use column-level security features inQuickSight to limit the PII that users can retrieve from Amazon S3 by using AmazonAthena. Define QuickSight access levels based on the PII access requirements of theusers. C. Build a custom query builder UI that will run Athena queries in the background to accessthe data. Create user groups in Amazon Cognito. Assign access levels to the user groupsbased on the PII access requirements of the users. D. Create IAM roles that have different levels of granular access. Assign the IAM roles toIAM user groups. Use an identity-based policy to assign access levels to user groups at thecolumn level.
Answer: A
Explanation:
Amazon Athena is a serverless, interactive query service that enables you to analyze data
in Amazon S3 using standard SQL. AWS Lake Formation is a service that helps you build,
secure, and manage data lakes on AWS. You can use AWS Lake Formation to create data
filters that define the level of access for different IAM roles based on the columns, rows, or
tags of the data. By using Amazon Athena to query the data and AWS Lake Formation to
create data filters, the company can meet the requirements of ensuring that user groups
can access only the PII that they require with the least effort. The solution is to use Amazon
Athena to query the data in the data lake that is in Amazon S3. Then, set up AWS Lake
Formation and create data filters to establish levels of access for the company’s IAM roles.
For example, a data filter can allow a user group to access only the columns that contain
the PII that they need, such as name and email address, and deny access to the columns
that contain the PII that they do not need, such as phone number and social security
number. Finally, assign each user to the IAM role that matches the user’s PII access
requirements. This way, the user groups can access the data in the data lake securely and
efficiently. The other options are either not feasible or not optimal. Using Amazon
QuickSight to access the data (option B) would require the company to pay for the
QuickSight service and to configure the column-level security features for each user.
Building a custom query builder UI that will run Athena queries in the background to access
the data (option C) would require the company to develop and maintain the UI and to
integrate it with Amazon Cognito. Creating IAM roles that have different levels of granular
access (option D) would require the company to manage multiple IAM roles and policies and to ensure that they are aligned with the data schema. References:
Amazon Athena
AWS Lake Formation
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,
Chapter 4: Data Analysis and Visualization, Section 4.3: Amazon Athena
Question # 3
A company receives call logs as Amazon S3 objects that contain sensitive customerinformation. The company must protect the S3 objects by using encryption. The companymust also use encryption keys that only specific employees can access.Which solution will meet these requirements with the LEAST effort?
A. Use an AWS CloudHSM cluster to store the encryption keys. Configure the process thatwrites to Amazon S3 to make calls to CloudHSM to encrypt and decrypt the objects.Deploy an IAM policy that restricts access to the CloudHSM cluster. B. Use server-side encryption with customer-provided keys (SSE-C) to encrypt the objectsthat contain customer information. Restrict access to the keys that encrypt the objects. C. Use server-side encryption with AWS KMS keys (SSE-KMS) to encrypt the objects thatcontain customer information. Configure an IAM policy that restricts access to the KMSkeys that encrypt the objects. D. Use server-side encryption with Amazon S3 managed keys (SSE-S3) to encrypt theobjects that contain customer information. Configure an IAM policy that restricts access tothe Amazon S3 managed keys that encrypt the objects.
Answer: C
Explanation: Option C is the best solution to meet the requirements with the least effort
because server-side encryption with AWS KMS keys (SSE-KMS) is a feature that allows
you to encrypt data at rest in Amazon S3 using keys managed by AWS Key Management
Service (AWS KMS). AWS KMS is a fully managed service that enables you to create and
manage encryption keys for your AWS services and applications. AWS KMS also allows
you to define granular access policies for your keys, such as who can use them to encrypt
and decrypt data, and under what conditions. By using SSE-KMS, you canprotect your S3
objects by using encryption keys that only specific employees can access, without having to manage the encryption and decryption process yourself.
Option A is not a good solution because it involves using AWS CloudHSM, which is a
service that provides hardware security modules (HSMs) in the AWS Cloud. AWS
CloudHSM allows you to generate and use your own encryption keys on dedicated
hardware that is compliant with various standards and regulations. However, AWS
CloudHSM is not a fully managed service and requires more effort to set up and maintain
than AWS KMS. Moreover, AWS CloudHSM does not integrate with Amazon S3, so you
have to configure the process that writes to S3 to make calls to CloudHSM to encrypt and
decrypt the objects, which adds complexity and latency to the data protection process.
Option B is not a good solution because it involves using server-side encryption with
customer-provided keys (SSE-C), which is a feature that allows you to encrypt data at rest
in Amazon S3 using keys that you provide and manage yourself. SSE-C requires you to
send your encryption key along with each request to upload or retrieve an object. However,
SSE-C does not provide any mechanism to restrict access to the keys that encrypt the
objects, so you have to implement your own key management and access control system,
which adds more effort and risk to the data protection process.
Option D is not a good solution because it involves using server-side encryption with
Amazon S3 managed keys (SSE-S3), which is a feature that allows you to encrypt data at
rest in Amazon S3 using keys that are managed by Amazon S3. SSE-S3 automatically
encrypts and decrypts your objects as they are uploaded and downloaded from S3.
However, SSE-S3 does not allow you to control who can access the encryption keys or
under what conditions. SSE-S3 uses a single encryption key for each S3 bucket, which is
shared by all users who have access to the bucket. This means that you cannot restrict
access to the keys that encrypt the objects by specific employees, which does not meet the
requirements.
References:
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
Protecting Data Using Server-Side Encryption with AWS KMS–Managed
Encryption Keys (SSE-KMS) - Amazon Simple Storage Service
What is AWS Key Management Service? - AWS Key Management Service
What is AWS CloudHSM? - AWS CloudHSM
Protecting Data Using Server-Side Encryption with Customer-Provided Encryption
Keys (SSE-C) - Amazon Simple Storage Service
Protecting Data Using Server-Side Encryption with Amazon S3-Managed
Encryption Keys (SSE-S3) - Amazon Simple Storage Service
Question # 4
A data engineer needs to maintain a central metadata repository that users access throughAmazon EMR and Amazon Athena queries. The repository needs to provide the schemaand properties of many tables. Some of the metadata is stored in Apache Hive. The dataengineer needs to import the metadata from Hive into the central metadata repository.Which solution will meet these requirements with the LEAST development effort?
A. Use Amazon EMR and Apache Ranger. B. Use a Hive metastore on an EMR cluster. C. Use the AWS Glue Data Catalog. D. Use a metastore on an Amazon RDS for MySQL DB instance.
Answer: C
Explanation: The AWS Glue Data Catalog is an Apache Hive metastore-compatible
catalog that provides a central metadata repository for various data sources and formats.
You can use the AWS Glue Data Catalog as an external Hive metastore for Amazon EMR
and Amazon Athena queries, and import metadata from existing Hive metastores into the Data Catalog. This solution requires the least development effort, as you can use AWS
Glue crawlers to automatically discover and catalog the metadata from Hive, and use the
AWS Glue console, AWS CLI, or Amazon EMR API to configure the Data Catalog as the
Hive metastore. The other options are either more complex or require additional steps,
such as setting up Apache Ranger for security, managing a Hive metastore on an EMR
cluster or an RDS instance, or migrating the metadata manually. References:
Using the AWS Glue Data Catalog as the metastore for Hive (Section: Specifying
AWS Glue Data Catalog as the metastore)
Metadata Management: Hive Metastore vs AWS Glue (Section: AWS Glue Data
Catalog)
AWS Glue Data Catalog support for Spark SQL jobs (Section: Importing metadata
from an existing Hive metastore)
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
(Chapter 5, page 131)
Question # 5
A company is planning to use a provisioned Amazon EMR cluster that runs Apache Sparkjobs to perform big data analysis. The company requires high reliability. A big data teammust follow best practices for running cost-optimized and long-running workloads onAmazon EMR. The team must find a solution that will maintain the company's current levelof performance.Which combination of resources will meet these requirements MOST cost-effectively?(Choose two.)
A. Use Hadoop Distributed File System (HDFS) as a persistent data store. B. Use Amazon S3 as a persistent data store. C. Use x86-based instances for core nodes and task nodes. D. Use Graviton instances for core nodes and task nodes. E. Use Spot Instances for all primary nodes.
Answer: B,D
Explanation: The best combination of resources to meet the requirements of high
reliability, cost-optimization, and performance for running Apache Spark jobs on Amazon
EMR is to use Amazon S3 as a persistent data store and Graviton instances for core nodes
and task nodes.
Amazon S3 is a highly durable, scalable, and secure object storage service that can store
any amount of data for a variety of use cases, including big data analytics1. Amazon S3 is
a better choice than HDFS as a persistent data store for Amazon EMR, as it decouples the
storage from the compute layer, allowing for more flexibility and cost-efficiency. Amazon S3
also supports data encryption, versioning, lifecycle management, and cross-region
replication1. Amazon EMR integrates seamlessly with Amazon S3, using EMR File System
(EMRFS) to access data stored in Amazon S3 buckets2. EMRFS also supports consistent
view, which enables Amazon EMR to provide read-after-write consistency for Amazon S3
objects that are accessed through EMRFS2.
Graviton instances are powered by Arm-based AWS Graviton2 processors that deliver up
to 40% better price performance over comparable current generation x86-based
instances3. Graviton instances are ideal for running workloads that are CPU-bound,
memory-bound, or network-bound, such as big data analytics, web servers, and opensource
databases3. Graviton instances are compatible with Amazon EMR, and can beused
for both core nodes and task nodes. Core nodes are responsible for running the data processing frameworks, such as Apache Spark, and storing data in HDFS or the local file
system. Task nodes are optional nodes that can be added to a cluster to increase the
processing power and throughput. By using Graviton instances for both core nodes and
task nodes, you can achieve higher performance and lower cost than using x86-based
instances.
Using Spot Instances for all primary nodes is not a good option, as it can compromise the
reliability and availability of the cluster. Spot Instances are spare EC2 instances that are
available at up to 90% discount compared to On-Demand prices, but they can be
interrupted by EC2 with a two-minute notice when EC2 needs the capacity back. Primary
nodes are the nodes that run the cluster software, such as Hadoop, Spark, Hive, and Hue,
and are essential for the cluster operation. If a primary node is interrupted by EC2, the
cluster will fail or become unstable. Therefore, it is recommended to use On-Demand
Instances or Reserved Instances for primary nodes, and use Spot Instances only for task
nodes that can tolerate interruptions. References:
A company wants to implement real-time analytics capabilities. The company wants to useAmazon Kinesis Data Streams and Amazon Redshift to ingest and process streaming dataat the rate of several gigabytes per second. The company wants to derive near real-timeinsights by using existing business intelligence (BI) and analytics tools.Which solution will meet these requirements with the LEAST operational overhead?
A. Use Kinesis Data Streams to stage data in Amazon S3. Use the COPY command toload data from Amazon S3 directly into Amazon Redshift to make the data immediatelyavailable for real-time analysis. B. Access the data from Kinesis Data Streams by using SQL queries. Create materializedviews directly on top of the stream. Refresh the materialized views regularly to query themost recent stream data. C. Create an external schema in Amazon Redshift to map the data from Kinesis DataStreams to an Amazon Redshift object. Create a materialized view to read data from thestream. Set the materialized view to auto refresh. D. Connect Kinesis Data Streams to Amazon Kinesis Data Firehose. Use Kinesis DataFirehose to stage the data in Amazon S3. Use the COPY command to load the data fromAmazon S3 to a table in Amazon Redshift.
Answer: C
Explanation: This solution meets the requirements of implementing real-time analytics
capabilities with the least operational overhead. By creating an external schema in Amazon
Redshift, you can access the data from Kinesis Data Streams using SQL queries without
having to load the data into the cluster. By creating a materialized view on top of the
stream, you can store the results of the query in the cluster and make them available for
analysis. By setting the materialized view to auto refresh, you can ensure that the view is
updated with the latest data from the stream at regular intervals. This way, you can derive
near real-time insights by using existing BI and analytics tools. References:
Amazon Redshift streaming ingestion
Creating an external schema for Amazon Kinesis Data Streams
Creating a materialized view for Amazon Kinesis Data Streams
Question # 7
A company stores details about transactions in an Amazon S3 bucket. The company wantsto log all writes to the S3 bucket into another S3 bucket that is in the same AWS Region.Which solution will meet this requirement with the LEAST operational effort?
A. Configure an S3 Event Notifications rule for all activities on the transactions S3 bucket toinvoke an AWS Lambda function. Program the Lambda function to write the event toAmazon Kinesis Data Firehose. Configure Kinesis Data Firehose to write the event to thelogs S3 bucket. B. Create a trail of management events in AWS CloudTraiL. Configure the trail to receivedata from the transactions S3 bucket. Specify an empty prefix and write-only events.Specify the logs S3 bucket as the destination bucket. C. Configure an S3 Event Notifications rule for all activities on the transactions S3 bucket toinvoke an AWS Lambda function. Program the Lambda function to write the events to thelogs S3 bucket. D. Create a trail of data events in AWS CloudTraiL. Configure the trail to receive data fromthe transactions S3 bucket. Specify an empty prefix and write-only events. Specify the logsS3 bucket as the destination bucket.
Answer: D
Explanation: This solution meets the requirement of logging all writes to the S3 bucket
into another S3 bucket with the least operational effort. AWS CloudTrail is a service that
records the API calls made to AWS services, including Amazon S3. By creating a trail of
data events, you can capture the details of the requests that are made to the transactions
S3 bucket, such as the requester, the time, the IP address, and the response elements. By
specifying an empty prefix and write-only events, you can filter the data events to only
include the ones that write to the bucket. By specifying the logs S3 bucket as the
destination bucket, you can store the CloudTrail logs in another S3 bucket that is in the
same AWS Region. This solution does not require any additional coding or configuration,
and it is more scalable and reliable than using S3 Event Notifications and Lambda
functions. References:
Logging Amazon S3 API calls using AWS CloudTrail
Creating a trail for data events
Enabling Amazon S3 server access logging
Question # 8
A data engineer has a one-time task to read data from objects that are in Apache Parquetformat in an Amazon S3 bucket. The data engineer needs to query only one column of thedata.Which solution will meet these requirements with the LEAST operational overhead?
A. Confiqure an AWS Lambda function to load data from the S3 bucket into a pandasdataframe- Write a SQL SELECT statement on the dataframe to query the requiredcolumn. B. Use S3 Select to write a SQL SELECT statement to retrieve the required column fromthe S3 objects. C. Prepare an AWS Glue DataBrew project to consume the S3 objects and to query the required column. D. Run an AWS Glue crawler on the S3 objects. Use a SQL SELECT statement in AmazonAthena to query the required column.
Answer: B
Explanation: Option B is the best solution to meet the requirements with the least
operational overhead because S3 Select is a feature that allows you to retrieve only a
subset of data from an S3 object by using simple SQL expressions. S3 Select works on
objects stored in CSV, JSON, or Parquet format. By using S3 Select, you can avoid the
need to download and process the entire S3 object, which reduces the amount of data
transferred and the computation time. S3 Select is also easy to use and does not require
any additional services or resources.
Option A is not a good solution because it involves writing custom code and configuring an
AWS Lambda function to load data from the S3 bucket into a pandas dataframe and query
the required column. This option adds complexity and latency to the data retrieval process
and requires additional resources and configuration.Moreover, AWS Lambda has
limitations on the execution time, memory, and concurrency, which may affect the
performance and reliability of the data retrieval process.
Option C is not a good solution because it involves creating and running an AWS Glue
DataBrew project to consume the S3 objects and query the required column. AWS Glue
DataBrew is a visual data preparation tool that allows you to clean, normalize, and
transform data without writing code. However, in this scenario, the data is already in
Parquet format, which is a columnar storage format that is optimized for analytics.
Therefore, there is no need to use AWS Glue DataBrew to prepare the data. Moreover,
AWS Glue DataBrew adds extra time and cost to the data retrieval process and requires
additional resources and configuration.
Option D is not a good solution because it involves running an AWS Glue crawler on the S3
objects and using a SQL SELECT statement in Amazon Athena to query the required
column. An AWS Glue crawler is a service that can scan data sources and create metadata
tables in the AWS Glue Data Catalog. The Data Catalog is a central repository that stores
information about the data sources, such as schema, format, and location. Amazon Athena
is a serverless interactive query service that allows you to analyze data in S3 using
standard SQL. However, in this scenario, the schema and format of the data are already
known and fixed, so there is no need to run a crawler to discover them. Moreover, running
a crawler and using Amazon Athena adds extra time and cost to the data retrieval process
and requires additional services and configuration.
References:
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
S3 Select and Glacier Select - Amazon Simple Storage Service
AWS Lambda - FAQs
What Is AWS Glue DataBrew? - AWS Glue DataBrew
Populating the AWS Glue Data Catalog - AWS Glue What is Amazon Athena? - Amazon Athena
Question # 9
A retail company has a customer data hub in an Amazon S3 bucket. Employees from manycountries use the data hub to support company-wide analytics. A governance team mustensure that the company's data analysts can access data only for customers who arewithin the same country as the analysts.Which solution will meet these requirements with the LEAST operational effort?
A. Create a separate table for each country's customer data. Provide access to eachanalyst based on the country that the analyst serves. B. Register the S3 bucket as a data lake location in AWS Lake Formation. Use the LakeFormation row-level security features to enforce the company's access policies. C. Move the data to AWS Regions that are close to the countries where the customers are.Provide access to each analyst based on the country that the analyst serves. D. Load the data into Amazon Redshift. Create a view for each country. Create separate1AM roles for each country to provide access to data from each country. Assign theappropriate roles to the analysts.
Answer: B
Explanation: AWS Lake Formation is a service that allows you to easily set up, secure,
and manage data lakes. One of the features of Lake Formation is row-level security, which
enables you to control access to specific rows or columns of data based on the identity or
role of the user. This feature is useful for scenarios where you need to restrict access to
sensitive or regulated data, such as customer data from different countries. By registering
the S3 bucket as a data lake location in Lake Formation, you can use the Lake Formation
console or APIs to define and apply row-level security policies to the data in the bucket.
You can also use Lake Formation blueprints to automate the ingestion and transformation
of data from various sources into the data lake. This solution requires the least operational
effort compared to the other options, as it does not involve creating or moving data, or
managing multiple tables, views, or roles. References:
AWS Lake Formation
Row-Level Security
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,
Chapter 4: Data Lakes and Data Warehouses, Section 4.2: AWS Lake Formation
Question # 10
A company uses Amazon RDS to store transactional data. The company runs an RDS DBinstance in a private subnet. A developer wrote an AWS Lambda function with defaultsettings to insert, update, or delete data in the DB instance.The developer needs to give the Lambda function the ability to connect to the DB instanceprivately without using the public internet.Which combination of steps will meet this requirement with the LEAST operationaloverhead? (Choose two.)
A. Turn on the public access setting for the DB instance. B. Update the security group of the DB instance to allow only Lambda function invocationson the database port. C. Configure the Lambda function to run in the same subnet that the DB instance uses. D. Attach the same security group to the Lambda function and the DB instance. Include aself-referencing rule that allows access through the database port. E. Update the network ACL of the private subnet to include a self-referencing rule thatallows access through the database port.
Answer: C,D
Explanation: To enable the Lambda function to connect to the RDS DB instance privately
without using the public internet, the best combination of steps is to configure the Lambda
function to run in the same subnet that the DB instance uses, and attach the same security
group to the Lambda function and the DB instance. This way, the Lambda function and the
DB instance can communicate within the same private network, and the security group can
allow traffic between them on the database port. This solution has the least operational
overhead, as it does not require any changes to the public access setting, the network
ACL, or the security group of the DB instance.
The other options are not optimal for the following reasons:
A. Turn on the public access setting for the DB instance. This option is not
recommended, as it would expose the DB instance to the public internet, which
can compromise the security and privacy of the data. Moreover, this option would
not enable the Lambda function to connect to the DB instance privately, as it would
still require the Lambda function to use the public internet to access the DB
instance.
B. Update the security group of the DB instance to allow only Lambda function
invocations on the database port. This option is not sufficient, as it would only
modify the inbound rules of the security group of the DB instance, but not the
outbound rules of the security group of the Lambda function. Moreover, this option would not enable the Lambda function to connect to the DB instance privately, as it
would still require the Lambda function to use the public internet to access the DB
instance.
E. Update the network ACL of the private subnet to include a self-referencing rule
that allows access through the database port. This option is not necessary, as the
network ACL of the private subnet already allows all traffic within the subnet by
default. Moreover, this option would not enable the Lambda function to connect to
the DB instance privately, as it would still require the Lambda function to use the
public internet to access the DB instance.
References:
1: Connecting to an Amazon RDS DB instance
2: Configuring a Lambda function to access resources in a VPC
3: Working with security groups
: Network ACLs
Question # 11
A company has five offices in different AWS Regions. Each office has its own humanresources (HR) department that uses a unique IAM role. The company stores employeerecords in a data lake that is based on Amazon S3 storage. A data engineering team needs to limit access to the records. Each HR department shouldbe able to access records for only employees who are within the HR department's Region.Which combination of steps should the data engineering team take to meet thisrequirement with the LEAST operational overhead? (Choose two.)
A. Use data filters for each Region to register the S3 paths as data locations. B. Register the S3 path as an AWS Lake Formation location. C. Modify the IAM roles of the HR departments to add a data filter for each department'sRegion. D. Enable fine-grained access control in AWS Lake Formation. Add a data filter for eachRegion. E. Create a separate S3 bucket for each Region. Configure an IAM policy to allow S3access. Restrict access based on Region.
Answer: B,D
Explanation: AWS Lake Formation is a service that helps you build, secure, and manage
data lakes on Amazon S3. You can use AWS Lake Formation to register the S3 path as a
data lake location, and enable fine-grained access control to limit access to the records
based on the HR department’s Region. You can use data filters to specify which S3
prefixes or partitions each HR department can access, and grant permissions to the IAM
roles of the HR departments accordingly. This solution will meet the requirement with the
least operational overhead, as it simplifies the data lake management and security, and
leverages the existing IAM roles of the HR departments12.
The other options are not optimal for the following reasons:
A. Use data filters for each Region to register the S3 paths as data locations. This
option is not possible, as data filters are not used to register S3 paths as data
locations, but to grant permissions to access specific S3 prefixes or partitions
within a data location. Moreover, this option does not specify how to limit access to
the records based on the HR department’s Region.
C. Modify the IAM roles of the HR departments to add a data filter for each
department’s Region. This option is not possible, as data filters are not added to
IAM roles, but to permissions granted by AWS Lake Formation. Moreover, this
option does not specify how to register the S3 path as a data lake location, or how
to enable fine-grained access control in AWS Lake Formation.
E. Create a separate S3 bucket for each Region. Configure an IAM policy to allow
S3 access. Restrict access based on Region. This option is not recommended, as
it would require more operational overhead to create and manage multiple S3
buckets, and to configure and maintain IAM policies for each HR department.
Moreover, this option does not leverage the benefits of AWS Lake Formation, such
as data cataloging, data transformation, and data governance.
References:
1: AWS Lake Formation
2: AWS Lake Formation Permissions
: AWS Identity and Access Management
: Amazon S3
Question # 12
A healthcare company uses Amazon Kinesis Data Streams to stream real-time health datafrom wearable devices, hospital equipment, and patient records.A data engineer needs to find a solution to process the streaming data. The data engineerneeds to store the data in an Amazon Redshift Serverless warehouse. The solution must support near real-time analytics of the streaming data and the previous day's data.Which solution will meet these requirements with the LEAST operational overhead?
A. Load data into Amazon Kinesis Data Firehose. Load the data into Amazon Redshift. B. Use the streaming ingestion feature of Amazon Redshift. C. Load the data into Amazon S3. Use the COPY command to load the data into AmazonRedshift. D. Use the Amazon Aurora zero-ETL integration with Amazon Redshift.
Answer: B
Explanation: The streaming ingestion feature of Amazon Redshift enables you to ingest
data from streaming sources, such as Amazon Kinesis Data Streams, into Amazon
Redshift tables in near real-time. You can use the streaming ingestion feature to process
the streaming data from the wearable devices, hospital equipment, and patient records.
The streaming ingestion feature also supports incremental updates, which means you can
append new data or update existing data in the Amazon Redshift tables. This way, you can
store the data in an Amazon Redshift Serverless warehouse and support near real-time
analytics of the streaming data and the previous day’s data. This solution meets the
requirements with the least operational overhead, as it does not require any additional
services or components to ingest and process the streaming data. The other options are
either not feasible or not optimal. Loading data into Amazon Kinesis Data Firehose and
then into Amazon Redshift (option A) would introduce additional latency and cost, as well
as require additional configuration and management. Loading data into Amazon S3 and
then using the COPY command to load the data into Amazon Redshift (option C) would
also introduce additional latency and cost, as well as require additional storage space and
ETL logic. Using the Amazon Aurora zero-ETL integration with Amazon Redshift (option D)
would not work, as it requires the data to be stored in Amazon Aurora first, which is not the
case for the streaming data from the healthcare company. References:
Using streaming ingestion with Amazon Redshift
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,
Chapter 3: Data Ingestion and Transformation, Section 3.5: Amazon Redshift
Streaming Ingestion
Question # 13
A company is migrating a legacy application to an Amazon S3 based data lake. A dataengineer reviewed data that is associated with the legacy application. The data engineerfound that the legacy data contained some duplicate information.The data engineer must identify and remove duplicate information from the legacyapplication data.Which solution will meet these requirements with the LEAST operational overhead?
A. Write a custom extract, transform, and load (ETL) job in Python. Use theDataFramedrop duplicatesf) function by importingthe Pandas library to perform datadeduplication. B. Write an AWS Glue extract, transform, and load (ETL) job. Usethe FindMatchesmachine learning(ML) transform to transform the data to perform data deduplication. C. Write a custom extract, transform, and load (ETL) job in Python. Import the Pythondedupe library. Use the dedupe library to perform data deduplication. D. Write an AWS Glue extract, transform, and load (ETL) job. Import the Python dedupelibrary. Use the dedupe library to perform data deduplication.
Answer: B
Explanation: AWS Glue is a fully managed serverless ETL service that can handle data
deduplication with minimal operational overhead. AWS Glue provides a built-in ML
transform called FindMatches, which can automatically identify and group similar records in
a dataset. FindMatches can also generate a primary key for each group of records and
remove duplicates. FindMatches does not require any coding or prior ML experience, as it
can learn from a sample of labeled data provided by the user. FindMatches can also scale
to handle large datasets and optimize the cost and performance of the ETL job.
References:
AWS Glue
FindMatches ML Transform
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
Question # 14
A company needs to build a data lake in AWS. The company must provide row-level dataaccess and column-level data access to specific teams. The teams will access the data byusing Amazon Athena, Amazon Redshift Spectrum, and Apache Hive from Amazon EMR.Which solution will meet these requirements with the LEAST operational overhead?
A. Use Amazon S3 for data lake storage. Use S3 access policies to restrict data access byrows and columns. Provide data access throughAmazon S3. B. Use Amazon S3 for data lake storage. Use Apache Ranger through Amazon EMR torestrict data access byrows and columns. Providedata access by using Apache Pig. C. Use Amazon Redshift for data lake storage. Use Redshift security policies to restrictdata access byrows and columns. Provide data accessby usingApache Spark and AmazonAthena federated queries. D. UseAmazon S3 for data lake storage. Use AWS Lake Formation to restrict data accessby rows and columns. Provide data access through AWS Lake Formation.
Answer: D
Explanation: Option D is the best solution to meet the requirements with the least
operational overhead because AWS Lake Formation is a fully managed service that
simplifies the process of building, securing, and managing data lakes. AWS Lake Formation allows you to define granular data access policies at the row and column level
for different users and groups. AWS Lake Formation also integrates with Amazon Athena,
Amazon Redshift Spectrum, and Apache Hive on Amazon EMR, enabling these services to
access the data in the data lake through AWS Lake Formation.
Option A is not a good solution because S3 access policies cannot restrict data access by
rows and columns. S3 access policies are based on the identity and permissions of the
requester, the bucket and object ownership, and the object prefix and tags. S3 access
policies cannot enforce fine-grained data access control at the row and column level.
Option B is not a good solution because it involves using Apache Ranger and Apache Pig,
which are not fully managed services and require additional configuration and
maintenance. Apache Ranger is a framework that provides centralized security
administration for data stored in Hadoop clusters, such as Amazon EMR. Apache Ranger
can enforce row-level and column-level access policies for Apache Hive tables. However,
Apache Ranger is not a native AWS service and requires manual installation and
configuration on Amazon EMR clusters. Apache Pig is a platform that allows you to analyze
large data sets using a high-level scripting language called Pig Latin. Apache Pig can
access data stored in Amazon S3 and process it using Apache Hive. However,Apache Pig
is not a native AWS service and requires manual installation and configuration on Amazon
EMR clusters.
Option C is not a good solution because Amazon Redshift is not a suitable service for data
lake storage. Amazon Redshift is a fully managed data warehouse service that allows you
to run complex analytical queries using standard SQL. Amazon Redshift can enforce rowlevel
and column-level access policies for different users and groups. However, Amazon
Redshift is not designed to store and process large volumes of unstructured or semistructured
data, which are typical characteristics of data lakes. Amazon Redshift is also
more expensive and less scalable than Amazon S3 for data lake storage.
References:
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
What Is AWS Lake Formation? - AWS Lake Formation
Using AWS Lake Formation with Amazon Athena - AWS Lake Formation
Using AWS Lake Formation with Amazon Redshift Spectrum - AWS Lake
Formation
Using AWS Lake Formation with Apache Hive on Amazon EMR - AWS Lake
Formation
Using Bucket Policies and User Policies - Amazon Simple Storage Service
Apache Ranger
Apache Pig
What Is Amazon Redshift? - Amazon Redshift
Question # 15
A company uses an Amazon Redshift provisioned cluster as its database. The Redshiftcluster has five reserved ra3.4xlarge nodes and uses key distribution.A data engineer notices that one of the nodes frequently has a CPU load over 90%. SQLQueries that run on the node are queued. The other four nodes usually have a CPU loadunder 15% during daily operations.The data engineer wants to maintain the current number of compute nodes. The dataengineer also wants to balance the load more evenly across all five compute nodes.Which solution will meet these requirements?
A. Change the sort key to be the data column that is most often used in a WHERE clauseof the SQL SELECT statement. B. Change the distribution key to the table column that has the largest dimension. C. Upgrade the reserved node from ra3.4xlarqe to ra3.16xlarqe. D. Change the primary key to be the data column that is most often used in a WHEREclause of the SQL SELECT statement.
Answer: B
Explanation: Changing the distribution key to the table column that has the largest
dimension will help to balance the load more evenly across all five compute nodes. The
distribution key determines how the rows of a table are distributed among the slices of the
cluster. If the distribution key is not chosen wisely, it can cause data skew, meaning some
slices will have more data than others, resulting in uneven CPU load and query
performance. By choosing the table column that has the largest dimension, meaning the
column that has the most distinct values, as the distribution key, the data engineer can
ensure that the rows are distributed more uniformly across the slices, reducing data skew
and improving query performance.
The other options are not solutions that will meet the requirements. Option A, changing the
sort key to be the data column that is most often used in a WHERE clause of the SQL
SELECT statement, will not affect the data distribution or the CPU load. The sort key
determines the order in which the rows of a table are stored on disk, which can improve the
performance of range-restricted queries, but not the load balancing. Option C, upgrading
the reserved node from ra3.4xlarge to ra3.16xlarge, will not maintain the current number of
compute nodes, as it will increase the cost and the capacity of the cluster. Option D,
changing the primary key to be the data column that is most often used in a WHERE
clause of the SQL SELECT statement, will not affect the data distribution or the CPU load
either. The primary key is a constraint that enforces the uniqueness of the rows in a table,
but it does not influence the data layout or the query optimization. References:
Choosing a data distribution style
Choosing a data sort key
Working with primary keys
Question # 16
A company is developing an application that runs on Amazon EC2 instances. Currently, thedata that the application generates is temporary. However, the company needs to persistthe data, even if the EC2 instances are terminated.A data engineer must launch new EC2 instances from an Amazon Machine Image (AMI)and configure the instances to preserve the data.Which solution will meet this requirement?
A. Launch new EC2 instances by using an AMI that is backed by an EC2 instance storevolume that contains the application data. Apply the default settings to the EC2 instances. B. Launch new EC2 instances by using an AMI that is backed by a root Amazon ElasticBlock Store (Amazon EBS) volume that contains the application data. Apply the defaultsettings to the EC2 instances. C. Launch new EC2 instances by using an AMI that is backed by an EC2 instance storevolume. Attach an Amazon Elastic Block Store (Amazon EBS) volume to contain theapplication data. Apply the default settings to the EC2 instances. D. Launch new EC2 instances by using an AMI that is backed by an Amazon Elastic BlockStore (Amazon EBS) volume. Attach an additional EC2 instance store volume to containthe application data. Apply the default settings to the EC2 instances.
Answer: C
Explanation: Amazon EC2 instances can use two types of storage volumes: instance
store volumes and Amazon EBS volumes. Instance store volumes are ephemeral, meaning
they are only attached to the instance for the duration of its life cycle. If the instance is
stopped, terminated, or fails, the data on the instance store volume is lost. Amazon EBS
volumes are persistent, meaning they can be detached from the instance and attached to
another instance, and the data on the volume is preserved. To meet the requirement of
persisting the data even if the EC2 instances are terminated, the data engineer must use
Amazon EBS volumes to store the application data. The solution is to launch new EC2
instances by using an AMI that is backed by an EC2 instance store volume, which is the
default option for most AMIs. Then, the data engineer must attach an Amazon EBS volume
to each instance and configure the application to write the data to the EBS volume. This
way, the data will be saved on the EBS volume and can be accessed by another instance if
needed. The data engineer can apply the default settings to the EC2 instances, as there is
no need to modify the instance type, security group, or IAM role for this solution. The other
options are either not feasible or not optimal. Launching new EC2 instances by using an
AMI that is backed by an EC2 instance store volume that contains the application data
(option A) or by using an AMI that is backed by a root Amazon EBS volume that contains
the application data (option B) would not work, as the data on the AMI would be outdated
and overwritten by the new instances. Attaching an additional EC2 instance store volume
to contain the application data (option D)would not work, as the data on the instance store
volume would be lost if the instance is terminated. References:
Amazon EC2 Instance Store
Amazon EBS Volumes
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,
Chapter 2: Data Store Management, Section 2.1: Amazon EC2
Question # 17
A data engineer must ingest a source of structured data that is in .csv format into anAmazon S3 data lake. The .csv files contain 15 columns. Data analysts need to runAmazon Athena queries on one or two columns of the dataset. The data analysts rarelyquery the entire file.Which solution will meet these requirements MOST cost-effectively?
A. Use an AWS Glue PySpark job to ingest the source data into the data lake in .csvformat. B. Create an AWS Glue extract, transform, and load (ETL) job to read from the .csvstructured data source. Configure the job to ingest the data into the data lake in JSONformat.C. Use an AWS Glue PySpark job to ingest the source data into the data lake in ApacheAvro format. D. Create an AWS Glue extract, transform, and load (ETL) job to read from the .csvstructured data source. Configure the job to write the data into the data lake in ApacheParquet format.
Answer: D
Explanation: Amazon Athena is a serverless interactive query service that allows you to
analyze data in Amazon S3 using standard SQL. Athena supports various data formats,
such as CSV,JSON, ORC, Avro, and Parquet. However, not all data formats are equally
efficient for querying. Some data formats, such as CSV and JSON, are row-oriented,
meaning that they store data as a sequence of records, each with the same fields. Roworiented
formats are suitable for loading and exporting data, but they are not optimal for
analytical queries that often access only a subset of columns. Row-oriented formats also
do not support compression or encoding techniques that can reduce the data size and
improve the query performance.
On the other hand, some data formats, such as ORC and Parquet, are column-oriented,
meaning that they store data as a collection of columns, each with a specific data type.
Column-oriented formats are ideal for analytical queries that often filter, aggregate, or join
data by columns. Column-oriented formats also support compression and encoding
techniques that can reduce the data size and improve the query performance. For
example, Parquet supports dictionary encoding, which replaces repeated values with
numeric codes, and run-length encoding, which replaces consecutive identical values with
a single value and a count. Parquet also supports various compression algorithms, such as
Snappy, GZIP, and ZSTD, that can further reduce the data size and improve the query
performance.
Therefore, creating an AWS Glue extract, transform, and load (ETL) job to read from the
.csv structured data source and writing the data into the data lake in Apache Parquet
format will meet the requirements most cost-effectively. AWS Glue is a fully managed service that provides a serverless data integration platform for data preparation, data
cataloging, and data loading. AWS Glue ETL jobs allow you to transform and load data
from various sources into various targets, using either a graphical interface (AWS Glue
Studio) or a code-based interface (AWS Glue console or AWS Glue API). By using AWS
Glue ETL jobs, you can easily convert the data from CSV to Parquet format, without having
to write or manage any code. Parquet is a column-oriented format that allows Athena to
scan only the relevant columns and skip the rest, reducing the amount of data read from
S3. This solution will also reduce the cost of Athena queries, as Athena charges based on
the amount of data scanned from S3.
The other options are not as cost-effective as creating an AWS Glue ETL job to write the
data into the data lake in Parquet format. Using an AWS Glue PySpark job to ingest the
source data into the data lake in .csv format will not improve the query performance or
reduce the query cost, as .csv is a row-oriented format that does not support columnar
access or compression. Creating an AWS Glue ETL job to ingest the data into the data
lake in JSON format will not improve the query performance or reduce the query cost, as
JSON is also a row-oriented format that does not support columnar access or compression.
Using an AWS Glue PySpark job to ingest the source data into the data lake in Apache
Avro format will improve the query performance, as Avro is a column-oriented format that
supports compression and encoding, but it will require more operational effort, as you will
need to write and maintain PySpark code to convert the data from CSV to Avro format.
References:
Amazon Athena
Choosing the Right Data Format
AWS Glue
[AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide],
Chapter 5: Data Analysis and Visualization, Section 5.1: Amazon Athena
Question # 18
A data engineer uses Amazon Redshift to run resource-intensive analytics processes onceevery month. Every month, the data engineer creates a new Redshift provisioned cluster.The data engineer deletes the Redshift provisioned cluster after the analytics processesare complete every month. Before the data engineer deletes the cluster each month, thedata engineer unloads backup data from the cluster to an Amazon S3 bucket.The data engineer needs a solution to run the monthly analytics processes that does notrequire the data engineer to manage the infrastructure manually.Which solution will meet these requirements with the LEAST operational overhead?
A. Use Amazon Step Functions to pause the Redshift cluster when the analytics processesare complete and to resume the cluster to run new processes every month. B. Use Amazon Redshift Serverless to automatically process the analytics workload. C. Use the AWS CLI to automatically process the analytics workload. D. Use AWS CloudFormation templates to automatically process the analytics workload.
Answer: B
Explanation: Amazon Redshift Serverless is a new feature of Amazon Redshift that
enables you to run SQL queries on data in Amazon S3 without provisioning or managing
any clusters. You can use Amazon Redshift Serverless to automatically process the
analytics workload, as it scales up and down the compute resources based on the query
demand, and charges you only for the resources consumed. This solution will meet the
requirements with the least operational overhead, as it does not require the data engineer
to create, delete, pause, or resume any Redshift clusters, or to manage any infrastructure
manually. You can use the Amazon Redshift Data API to run queries from the AWS CLI,
AWS SDK, or AWS Lambda functions12.
The other options are not optimal for the following reasons:
A. Use Amazon Step Functions to pause the Redshift cluster when the analytics
processes are complete and to resume the cluster to run new processes every
month. This option is not recommended, as it would still require the data engineer
to create and delete a new Redshift provisioned cluster every month, which can
incur additional costs and time. Moreover, this option would require the data
engineer to use Amazon Step Functions to orchestrate the workflow of pausing
and resuming the cluster, which can add complexity and overhead.
C. Use the AWS CLI to automatically process the analytics workload. This option
is vague and does not specify how the AWS CLI is used to process the analytics
workload. The AWS CLI can be used to run queries on data in Amazon S3 using
Amazon Redshift Serverless, Amazon Athena, or Amazon EMR, but each of these
services has different features and benefits. Moreover, this option does not
address the requirement of not managing the infrastructure manually, as the data
engineer may still need to provision and configure some resources, such as
Amazon EMR clusters or Amazon Athena workgroups.
D. Use AWS CloudFormation templates to automatically process the analytics
workload. This option is also vague and does not specify how AWS
CloudFormation templates are used to process the analytics workload. AWS
CloudFormation is a service that lets you model and provision AWS resources
using templates. You can use AWS CloudFormation templates to create and
delete a Redshift provisioned cluster every month, or to create and configure other
AWS resources, such as Amazon EMR, Amazon Athena, or Amazon Redshift
Serverless. However, this option does not address the requirement of not
managing the infrastructure manually, as the data engineer may still need to write
and maintain the AWS CloudFormation templates, and to monitor the status and
performance of the resources.
References:
1: Amazon Redshift Serverless
2: Amazon Redshift Data API
: Amazon Step Functions
: AWS CLI
: AWS CloudFormation
Question # 19
A financial company wants to use Amazon Athena to run on-demand SQL queries on apetabyte-scale dataset to support a business intelligence (BI) application. An AWS Glue jobthat runs during non-business hours updates the dataset once every day. The BIapplication has a standard data refresh frequency of 1 hour to comply with companypolicies. A data engineer wants to cost optimize the company's use of Amazon Athena withoutadding any additional infrastructure costs.Which solution will meet these requirements with the LEAST operational overhead?
A. Configure an Amazon S3 Lifecycle policy to move data to the S3 Glacier Deep Archivestorage class after 1 day B. Use the query result reuse feature of Amazon Athena for the SQL queries. C. Add an Amazon ElastiCache cluster between the Bl application and Athena. D. Change the format of the files that are in the dataset to Apache Parquet.
Answer: B
Explanation: The best solution to cost optimize the company’s use of Amazon Athena
without adding any additional infrastructure costs is to use the query result reuse feature of
AmazonAthena for the SQL queries. This feature allows you to run the same query multiple
times without incurring additional charges, as long as the underlying data has not changed
and the query results are still in the query result location in Amazon S31. This feature is
useful for scenarios where you have a petabyte-scale dataset that is updated infrequently,
such as once a day, and you have a BI application that runs the same queries repeatedly,
such as every hour. By using the query result reuse feature, you can reduce the amount of
data scanned by your queries and save on the cost of running Athena. You can enable or
disable this feature at the workgroup level or at the individual query level1.
Option A is not the best solution, as configuring an Amazon S3 Lifecycle policy to move
data to the S3 Glacier Deep Archive storage class after 1 day would not cost optimize the
company’s use of Amazon Athena, but rather increase the cost and complexity. Amazon
S3 Lifecycle policies are rules that you can define to automatically transition objects
between different storage classes based on specified criteria, such as the age of the
object2. S3 Glacier Deep Archive is the lowest-cost storage class in Amazon S3, designed
for long-term data archiving that is accessed once or twice in a year3. While moving data to
S3 Glacier Deep Archive can reduce the storage cost, it would also increase the retrieval
cost and latency, as it takes up to 12 hours to restore the data from S3 Glacier Deep
Archive3. Moreover, Athena does not support querying data that is in S3 Glacier or S3
Glacier Deep Archive storage classes4. Therefore, using this option would not meet the
requirements of running on-demand SQL queries on the dataset.
Option C is not the best solution, as adding an Amazon ElastiCache cluster between the BI
application and Athena would not cost optimize the company’s use of Amazon Athena, but
rather increase the cost and complexity. Amazon ElastiCache is a service that offers fully
managed in-memory data stores, such as Redis and Memcached, that can improve the
performance and scalability of web applications by caching frequently accessed data.
While using ElastiCache can reduce the latency and load on the BI application, it would not
reduce the amount of data scanned by Athena, which is the main factor that determines the
cost of running Athena. Moreover, using ElastiCache would introduce additional infrastructure costs and operational overhead, as you would have to provision, manage,
and scale the ElastiCache cluster, and integrate it with the BI application and Athena.
Option D is not the best solution, as changing the format of the files that are in the dataset
to Apache Parquet would not cost optimize the company’s use of Amazon Athena without
adding any additional infrastructure costs, but rather increase the complexity. Apache
Parquet is a columnar storage format that can improve the performance of analytical
queries by reducing the amount of data that needs to be scanned and providing efficient
compression and encoding schemes. However,changing the format of the files that are in
the dataset to Apache Parquet would require additional processing and transformation
steps, such as using AWS Glue or Amazon EMR to convert the files from their original
format to Parquet, and storing the converted files in a separate location in Amazon S3. This
would increase the complexity and the operational overhead of the data pipeline, and also
incur additional costs for using AWS Glue or Amazon EMR. References:
Query result reuse
Amazon S3 Lifecycle
S3 Glacier Deep Archive
Storage classes supported by Athena
[What is Amazon ElastiCache?]
[Amazon Athena pricing]
[Columnar Storage Formats]
AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
Question # 20
A company uses an Amazon Redshift cluster that runs on RA3 nodes. The company wantsto scale read and write capacity to meet demand. A data engineer needs to identify asolution that will turn on concurrency scaling.Which solution will meet this requirement?
A. Turn on concurrency scaling in workload management (WLM) for Redshift Serverlessworkgroups. B. Turn on concurrency scaling at the workload management (WLM) queue level in theRedshift cluster. C. Turn on concurrency scaling in the settings duringthe creation of andnew Redshiftcluster. D. Turn on concurrency scaling for the daily usage quota for the Redshift cluster.
Answer: B
Explanation: Concurrency scaling is a feature that allows you to support thousands of
concurrent users and queries, with consistently fast query performance. When you turn on
concurrency scaling, Amazon Redshift automatically adds query processing power in
seconds to process queries without any delays. You can manage which queries are sent to
the concurrency-scaling cluster by configuring WLM queues. To turn on concurrency
scaling for a queue, set the Concurrency Scaling mode value to auto. The other options are
either incorrect or irrelevant, as they do not enable concurrency scaling for the existing
Redshift cluster on RA3 nodes. References:
Working with concurrency scaling - Amazon Redshift
Amazon Redshift Concurrency Scaling - Amazon Web Services