[Dec 27, 2021] Step by Step Guide to Prepare for Professional-Data-Engineer Exam BrainDumps [Q144-Q167]

Share

Dec 27, 2021 Step by Step Guide to Prepare for Professional-Data-Engineer Exam BrainDumps

Google Cloud Certified Professional-Data-Engineer Real Exam Questions and Answers FREE Updated on 2021


This course is normally taken by data scientists, data analysts, and business analysts who are in the field of Google Сloud. It is a good way to prepare for your final exam because it teaches you all the details including 7 modules that cover all the Professional Data Engineer exam objectives:

  • Data Analytics on the Cloud
  • Data Processing Architectures
  • Additional Resources
  • Compute and Storage Fundamentals
  • Introducing Google Cloud Platform ‘
  • Machine Learning
  • Scaling Data Analytics

 

NEW QUESTION 144
Which of the following statements about Legacy SQL and Standard SQL is not true?

  • A. One difference between the two query languages is how you specify fully-qualified table names (i.e. table names that include their associated project name).
  • B. Standard SQL is the preferred query language for BigQuery.
  • C. If you write a query in Legacy SQL, it might generate an error if you try to run it with Standard SQL.
  • D. You need to set a query language for each dataset and the default is Standard SQL.

Answer: D

Explanation:
You do not set a query language for each dataset. It is set each time you run a query and the default query language is Legacy SQL.
Standard SQL has been the preferred query language since BigQuery 2.0 was released.
In legacy SQL, to query a table with a project-qualified name, you use a colon, :, as a separator. In standard SQL, you use a period, ., instead.
Due to the differences in syntax between the two query languages (such as with project- qualified table names), if you write a query in Legacy SQL, it might generate an error if you try to run it with Standard SQL.
Reference:
https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql

 

NEW QUESTION 145
You have some data, which is shown in the graphic below. The two dimensions are X and Y, and the
shade of each dot represents what class it is. You want to classify this data accurately using a linear
algorithm. To do this you need to add a synthetic feature. What should the value of that feature be?

  • A. cos(X)
  • B. X^2+Y^2
  • C. X^2
  • D. Y^2

Answer: A

 

NEW QUESTION 146
MJTelco Case Study
Company Overview
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the
world. The company has patents for innovative optical communications hardware. Based on these patents,
they can create many reliable, high-speed backbone links with inexpensive hardware.
Company Background
Founded by experienced telecom executives, MJTelco uses technologies originally developed to
overcome communications challenges in space. Fundamental to their operation, they need to create a
distributed data infrastructure that drives real-time analysis and incorporates machine learning to
continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the
network allowing them to account for the impact of dynamic regional politics on location availability and
cost.
Their management and operations teams are situated all around the globe creating many-to-many
relationship between data consumers and provides in their system. After careful consideration, they
decided public cloud is the perfect environment to support their needs.
Solution Concept
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
Scale and harden their PoC to support significantly more data flows generated when they ramp to more

than 50,000 installations.
Refine their machine-learning cycles to verify and improve the dynamic models they use to control

topology definition.
MJTelco will also use three separate operating environments - development/test, staging, and production
- to meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements
Scale up their production environment with minimal cost, instantiating resources when and where

needed in an unpredictable, distributed telecom user community.
Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.

Provide reliable and timely access to data for analysis from distributed research workers

Maintain isolated environments that support rapid iteration of their machine-learning models without

affecting their customers.
Technical Requirements
Ensure secure and efficient transport and storage of telemetry data

Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows

each.
Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately

100m records/day
Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems

both in telemetry flows and in production learning cycles.
CEO Statement
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive
hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize
our large distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement
Our public cloud services must operate as advertised. We need resources that scale and keep our data
secure. We also need environments in which our data scientists can carefully study and quickly adapt our
models. Because we rely on automation to process our data, we also need our development and test
environments to work as we iterate.
CFO Statement
The project is too large for us to maintain the hardware and software required for the data and analysis.
Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on
automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to
work on our high-value problems instead of problems with our data pipelines.
You need to compose visualizations for operations teams with the following requirements:
The report must include telemetry data from all 50,000 installations for the most resent 6 weeks

(sampling once every minute).
The report must not be more than 3 hours delayed from live data.

The actionable report should only show suboptimal links.

Most suboptimal links should be sorted to the top.

Suboptimal links can be grouped and filtered by regional geography.

User response time to load the report must be <5 seconds.

Which approach meets the requirements?

  • A. Load the data into Google Sheets, use formulas to calculate a metric, and use filters/sorting to show
    only suboptimal links in a table.
  • B. Load the data into Google BigQuery tables, write a Google Data Studio 360 report that connects to
    your data, calculates a metric, and then uses a filter expression to show only suboptimal rows in a
    table.
  • C. Load the data into Google BigQuery tables, write Google Apps Script that queries the data, calculates
    the metric, and shows only suboptimal rows in a table in Google Sheets.
  • D. Load the data into Google Cloud Datastore tables, write a Google App Engine Application that queries
    all rows, applies a function to derive the metric, and then renders results in a table using the Google
    charts and visualization API.

Answer: D

 

NEW QUESTION 147
You have a data pipeline with a Cloud Dataflow job that aggregates and writes time series metrics to Cloud Bigtable. This data feeds a dashboard used by thousands of users across the organization. You need to support additional concurrent users and reduce the amount of time required to write the data. Which two actions should you take? (Choose two.)

  • A. Modify your Cloud Dataflow pipeline to use the Flatten transform before writing to Cloud Bigtable
  • B. Increase the maximum number of Cloud Dataflow workers by setting maxNumWorkers in PipelineOptions
  • C. Modify your Cloud Dataflow pipeline to use the CoGroupByKey transform before writing to Cloud Bigtable
  • D. Configure your Cloud Dataflow pipeline to use local execution
  • E. Increase the number of nodes in the Cloud Bigtable cluster

Answer: A,C

 

NEW QUESTION 148
You designed a database for patient records as a pilot project to cover a few hundred patients in three clinics.
Your design used a single database table to represent all patients and their visits, and you used self-joins to generate reports. The server resource utilization was at 50%. Since then, the scope of the project has expanded.
The database must now store 100 times more patient records. You can no longer run the reports, because they either take too long or they encounter errors with insufficient compute resources. How should you adjust the database design?

  • A. Partition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs, and use unions for consolidated reports.
  • B. Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join.
  • C. Add capacity (memory and disk space) to the database server by the order of 200.
  • D. Shard the tables into smaller ones based on date ranges, and only generate reports with prespecified date ranges.

Answer: D

 

NEW QUESTION 149
Which of the following are examples of hyperparameters? (Select 2 answers.)

  • A. Weights
  • B. Number of nodes in each hidden layer
  • C. Biases
  • D. Number of hidden layers

Answer: B,D

Explanation:
If model parameters are variables that get adjusted by training with existing data, your hyperparameters are the variables about the training process itself. For example, part of setting up a deep neural network is deciding how many "hidden" layers of nodes to use between the input layer and the output layer, as well as how many nodes each layer should use. These variables are not directly related to the training data at all.
They are configuration variables. Another difference is that parameters change during a training job, while the hyperparameters are usually constant during a job.
Weights and biases are variables that get adjusted during the training process, so they are not hyperparameters.
Reference: https://cloud.google.com/ml-engine/docs/hyperparameter-tuning-overview

 

NEW QUESTION 150
You are working on a sensitive project involving private user data. You have set up a project on Google Cloud Platform to house your work internally. An external consultant is going to assist with coding a complex transformation in a Google Cloud Dataflow pipeline for your project. How should you maintain users' privacy?

  • A. Grant the consultant the Cloud Dataflow Developer role on the project.
  • B. Create a service account and allow the consultant to log on with it.
  • C. Grant the consultant the Viewer role on the project.
  • D. Create an anonymized sample of the data for the consultant to work with in a different project.

Answer: B

 

NEW QUESTION 151
Google Cloud Bigtable indexes a single value in each row. This value is called the _______.

  • A. primary key
  • B. row key
  • C. unique key
  • D. master key

Answer: B

Explanation:
Explanation
Cloud Bigtable is a sparsely populated table that can scale to billions of rows and thousands of columns, allowing you to store terabytes or even petabytes of data. A single value in each row is indexed; this value is known as the row key.
Reference: https://cloud.google.com/bigtable/docs/overview

 

NEW QUESTION 152
You have enabled the free integration between Firebase Analytics and Google BigQuery. Firebase now automatically creates a new table daily in BigQuery in the format app_events_YYYYMMDD.You want to query all of the tables for the past 30 days in legacy SQL. What should you do?

  • A. Use WHEREdate BETWEEN YYYY-MM-DD AND YYYY-MM-DD
  • B. Use the TABLE_DATE_RANGEfunction
  • C. Use SELECT IF.(date >= YYYY-MM-DD AND date <= YYYY-MM-DD
  • D. Use the WHERE_PARTITIONTIMEpseudo column

Answer: B

Explanation:
Explanation/Reference: https://cloud.google.com/blog/products/gcp/using-bigquery-and-firebase-analytics-to-understand- your-mobile-app?hl=am

 

NEW QUESTION 153
You have some data, which is shown in the graphic below. The two dimensions are X and Y, and the shade of each dot represents what class it is. You want to classify this data accurately using a linear algorithm. To do this you need to add a synthetic feature. What should the value of that feature be?

  • A. cos(X)
  • B. X^2+Y^2
  • C. X^2
  • D. Y^2

Answer: B

 

NEW QUESTION 154
Your company handles data processing for a number of different clients. Each client prefers to use their own suite of analytics tools, with some allowing direct query access via Google BigQuery. You need to secure the data so that clients cannot see each other's data. You want to ensure appropriate access to the data. Which three steps should you take? (Choose three.)

  • A. Only allow a service account to access the datasets.
  • B. Load data into different partitions.
  • C. Use the appropriate identity and access management (IAM) roles for each client's users.
  • D. Load data into a different dataset for each client.
  • E. Put each client's BigQuery dataset into a different table.
  • F. Restrict a client's dataset to approved users.

Answer: C,D,F

 

NEW QUESTION 155
Flowlogistic Case Study
Company Overview
Flowlogistic is a leading logistics and supply chain provider. They help businesses throughout the world manage their resources and transport them to their final destination. The company has grown rapidly, expanding their offerings to include rail, truck, aircraft, and oceanic shipping.
Company Background
The company started as a regional trucking company, and then expanded into other logistics market. Because they have not updated their infrastructure, managing and tracking orders and shipments has become a bottleneck. To improve operations, Flowlogistic developed proprietary technology for tracking shipments in real time at the parcel level. However, they are unable to deploy it because their technology stack, based on Apache Kafka, cannot support the processing volume. In addition, Flowlogistic wants to further analyze their orders and shipments to determine how best to deploy their resources.
Solution Concept
Flowlogistic wants to implement two concepts using the cloud:
* Use their proprietary technology in a real-time inventory-tracking system that indicates the location of their loads
* Perform analytics on all their orders and shipment logs, which contain both structured and unstructured data, to determine how best to deploy resources, which markets to expand info. They also want to use predictive analytics to learn earlier when a shipment will be delayed.
Existing Technical Environment
Flowlogistic architecture resides in a single data center:
* Databases
- 8 physical servers in 2 clusters
- SQL Server - user data, inventory, static data
- 3 physical servers
- Cassandra - metadata, tracking messages
10 Kafka servers - tracking message aggregation and batch insert
* Application servers - customer front end, middleware for order/customs
- 60 virtual machines across 20 physical servers
- Tomcat - Java services
- Nginx - static content
- Batch servers
* Storage appliances
- iSCSI for virtual machine (VM) hosts
- Fibre Channel storage area network (FC SAN) - SQL server storage
Network-attached storage (NAS) image storage, logs, backups
* 10 Apache Hadoop /Spark servers
- Core Data Lake
- Data analysis workloads
* 20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,
Business Requirements
* Build a reliable and reproducible environment with scaled panty of production.
* Aggregate data in a centralized Data Lake for analysis
* Use historical data to perform predictive analytics on future shipments
* Accurately track every shipment worldwide using proprietary technology
* Improve business agility and speed of innovation through rapid provisioning of new resources
* Analyze and optimize architecture for performance in the cloud
* Migrate fully to the cloud if all other requirements are met
Technical Requirements
* Handle both streaming and batch data
* Migrate existing Hadoop workloads
* Ensure architecture is scalable and elastic to meet the changing demands of the company.
* Use managed services whenever possible
* Encrypt data flight and at rest
Connect a VPN between the production data center and cloud environment
SEO Statement
We have grown so quickly that our inability to upgrade our infrastructure is really hampering further growth and efficiency. We are efficient at moving shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand where our customers are and what they are shipping.
CTO Statement
IT has never been a priority for us, so as our data has grown, we have not invested enough in our technology. I have a good staff to manage IT, but they are so busy managing our infrastructure that I cannot get them to do the things that really matter, such as organizing our data, building the analytics, and figuring out how to implement the CFO' s tracking technology.
CFO Statement
Part of our competitive advantage is that we penalize ourselves for late shipments and deliveries. Knowing where out shipments are at all times has a direct correlation to our bottom line and profitability. Additionally, I don't want to commit capital to building out a server environment.
Flowlogistic's management has determined that the current Apache Kafka servers cannot handle the data volume for their real-time inventory tracking system. You need to build a new system on Google Cloud Platform (GCP) that will feed the proprietary tracking software. The system must be able to ingest data from a variety of global sources, process and query in real-time, and store the data reliably. Which combination of GCP products should you choose?

  • A. Cloud Pub/Sub, Cloud SQL, and Cloud Storage
  • B. Cloud Pub/Sub, Cloud Dataflow, and Cloud Storage
  • C. Cloud Dataflow, Cloud SQL, and Cloud Storage
  • D. Cloud Pub/Sub, Cloud Dataflow, and Local SSD
  • E. Cloud Load Balancing, Cloud Dataflow, and Cloud Storage

Answer: A

 

NEW QUESTION 156
What is the recommended action to do in order to switch between SSD and HDD storage for your Google Cloud Bigtable instance?

  • A. run parallel instances where one is HDD and the other is SDD
  • B. create a third instance and sync the data from the two storage types via batch jobs
  • C. the selection is final and you must resume using the same storage type
  • D. export the data from the existing instance and import the data into a new instance

Answer: D

Explanation:
When you create a Cloud Bigtable instance and cluster, your choice of SSD or HDD storage for the cluster is permanent. You cannot use the Google Cloud Platform Console to change the type of storage that is used for the cluster.
If you need to convert an existing HDD cluster to SSD, or vice-versa, you can export the data from the existing instance and import the data into a new instance. Alternatively, you can write a Cloud Dataflow or Hadoop MapReduce job that copies the data from one instance to another.
Reference: https://cloud.google.com/bigtable/docs/choosing-ssd-hdd-

 

NEW QUESTION 157
When a Cloud Bigtable node fails, ____ is lost.

  • A. all data
  • B. the time dimension
  • C. the last transaction
  • D. no data

Answer: D

Explanation:
A Cloud Bigtable table is sharded into blocks of contiguous rows, called tablets, to help balance the workload of queries. Tablets are stored on Colossus, Google's file system, in SSTable format. Each tablet is associated with a specific Cloud Bigtable node.
Data is never stored in Cloud Bigtable nodes themselves; each node has pointers to a set of tablets that are stored on Colossus. As a result:
Rebalancing tablets from one node to another is very fast, because the actual data is not copied. Cloud Bigtable simply updates the pointers for each node.
Recovery from the failure of a Cloud Bigtable node is very fast, because only metadata needs to be migrated to the replacement node.
When a Cloud Bigtable node fails, no data is lost
Reference: https://cloud.google.com/bigtable/docs/overview

 

NEW QUESTION 158
Your company's on-premises Apache Hadoop servers are approaching end-of-life, and IT has decided to migrate the cluster to Google Cloud Dataproc. A like-for-like migration of the cluster would require 50 TB of Google Persistent Disk per node. The CIO is concerned about the cost of using that much block storage. You want to minimize the storage cost of the migration. What should you do?

  • A. Use preemptible virtual machines (VMs) for the Cloud Dataproc cluster.
  • B. Put the data into Google Cloud Storage.
  • C. Migrate some of the cold data into Google Cloud Storage, and keep only the hot data in Persistent Disk.
  • D. Tune the Cloud Dataproc cluster so that there is just enough disk for all data.

Answer: A

Explanation:
Explanation/Reference: https://cloud.google.com/dataproc/

 

NEW QUESTION 159
You use BigQuery as your centralized analytics platform. New data is loaded every day, and an ETL pipeline modifies the original data and prepares it for the final users. This ETL pipeline is regularly modified and can generate errors, but sometimes the errors are detected only after 2 weeks. You need to provide a method to recover from these errors, and your backups should be optimized for storage costs. How should you organize your data in BigQuery and store your backups?

  • A. Organize your data in separate tables for each month, and export, compress, and store the data in Cloud Storage.
  • B. Organize your data in separate tables for each month, and use snapshot decorators to restore the table to a time prior to the corruption.
  • C. Organize your data in separate tables for each month, and duplicate your data on a separate dataset in BigQuery.
  • D. Organize your data in a single table, export, and compress and store the BigQuery data in Cloud Storage.

Answer: B

 

NEW QUESTION 160
You are migrating your data warehouse to BigQuery. You have migrated all of your data into tables in a dataset. Multiple users from your organization will be using the data. They should only see certain tables based on their team membership. How should you set user permissions?

  • A. Assign the users/groups data viewer access at the table level for each table
  • B. Create authorized views for each team in the same dataset in which the data resides, and assign the users/groups data viewer access to the authorized views
  • C. Create SQL views for each team in the same dataset in which the data resides, and assign the users/groups data viewer access to the SQL views
  • D. Create authorized views for each team in datasets created for each team. Assign the authorized views data viewer access to the dataset in which the data resides. Assign the users/groups data viewer access to the datasets in which the authorized views reside

Answer: A

 

NEW QUESTION 161
You need to create a data pipeline that copies time-series transaction data so that it can be queried from within BigQuery by your data science team for analysis. Every hour, thousands of transactions are updated with a new status. The size of the intitial dataset is 1.5 PB, and it will grow by 3 TB per day. The data is heavily structured, and your data science team will build machine learning models based on this data. You want to maximize performance and usability for your data science team. Which two strategies should you adopt? (Choose two.)

  • A. Develop a data pipeline where status updates are appended to BigQuery instead of updated.
  • B. Copy a daily snapshot of transaction data to Cloud Storage and store it as an Avro file. Use BigQuery's support for external data sources to query.
  • C. Use BigQuery UPDATE to further reduce the size of the dataset.
  • D. Denormalize the data as must as possible.
  • E. Preserve the structure of the data as much as possible.

Answer: A,B

 

NEW QUESTION 162
Your company's on-premises Apache Hadoop servers are approaching end-of-life, and IT has decided to migrate the cluster to Google Cloud Dataproc. A like-for-like migration of the cluster would require 50 TB of Google Persistent Disk per node. The CIO is concerned about the cost of using that much block storage.
You want to minimize the storage cost of the migration. What should you do?

  • A. Use preemptible virtual machines (VMs) for the Cloud Dataproc cluster.
  • B. Put the data into Google Cloud Storage.
  • C. Migrate some of the cold data into Google Cloud Storage, and keep only the hot data in Persistent Disk.
  • D. Tune the Cloud Dataproc cluster so that there is just enough disk for all data.

Answer: A

Explanation:
Explanation/Reference:
Reference: https://cloud.google.com/dataproc/

 

NEW QUESTION 163
Your analytics team wants to build a simple statistical model to determine which customers are most likely to work with your company again, based on a few different metrics. They want to run the model on Apache Spark, using data housed in Google Cloud Storage, and you have recommended using Google Cloud Dataproc to execute this job. Testing has shown that this workload can run in approximately 30 minutes on a 15-node cluster, outputting the results into Google BigQuery. The plan is to run this workload weekly. How should you optimize the cluster for cost?

  • A. Use SSDs on the worker nodes so that the job can run faster
  • B. Migrate the workload to Google Cloud Dataflow
  • C. Use pre-emptible virtual machines (VMs) for the cluster
  • D. Use a higher-memory node so that the job runs faster

Answer: B

 

NEW QUESTION 164
When running a pipeline that has a BigQuery source, on your local machine, you continue to get permission denied errors. What could be the reason for that?

  • A. Your gcloud does not have access to the BigQuery resources
  • B. Pipelines cannot be run locally
  • C. You are missing gcloud on your machine
  • D. BigQuery cannot be accessed from local machines

Answer: A

Explanation:
When reading from a Dataflow source or writing to a Dataflow sink using DirectPipelineRunner, the Cloud Platform account that you configured with the gcloud executable will need access to the corresponding source/sink Reference: https://cloud.google.com/dataflow/java- sdk/JavaDoc/com/google/cloud/dataflow/sdk/runners/DirectPipelineRunner

 

NEW QUESTION 165
Which is the preferred method to use to avoid hotspotting in time series data in Bigtable?

  • A. Field promotion
  • B. Randomization
  • C. Hashing
  • D. Salting

Answer: A

Explanation:
Explanation
By default, prefer field promotion. Field promotion avoids hotspotting in almost all cases, and it tends to make it easier to design a row key that facilitates queries.
Reference:
https://cloud.google.com/bigtable/docs/schema-design-time-series#ensure_that_your_row_key_avoids_hotspotti

 

NEW QUESTION 166
An online retailer has built their current application on Google App Engine. A new initiative at the company mandates that they extend their application to allow their customers to transact directly via the application. They need to manage their shopping transactions and analyze combined data from multiple datasets using a business intelligence (BI) tool. They want to use only a single database for this purpose. Which Google Cloud database should they choose?

  • A. Cloud BigTable
  • B. Cloud Datastore
  • C. BigQuery
  • D. Cloud SQL

Answer: A

Explanation:
Explanation/Reference: https://cloud.google.com/solutions/business-intelligence/

 

NEW QUESTION 167
......

Ultimate Guide to Prepare Professional-Data-Engineer Certification Exam for Google Cloud Certified: https://www.free4dump.com/Professional-Data-Engineer-braindumps-torrent.html

Professional-Data-Engineer Ultimate Study Guide: https://drive.google.com/open?id=128ZaNbIJg10ms2Imwgaursy3dBqkNSRq