
Google Professional-Data-Engineer Real Exam Questions and Answers FREE
Exam Dumps Professional-Data-Engineer Practice Free Latest Google Practice Tests
The Google Professional Data Engineer certification is designed to equip the individuals with the required knowledge and skills to enable data-driven decision-making through collecting, transforming, and publishing data. To earn this certificate, the candidates will be required to pass a single test measuring their skills in leveraging, deploying, and continuously training pre-existing machine learning models. The qualifying exam also evaluates the ability of the applicants to design, build, operationalize, monitor, and secure data processing systems.
To become a Google Certified Professional Data Engineer, candidates must have a strong foundation in data engineering concepts and technologies. They must also possess excellent problem-solving skills and have a deep understanding of data analysis and interpretation. Candidates can prepare for the exam by taking online courses, attending training programs, and practicing using real-world data scenarios.
Google Professional-Data-Engineer certification exam is a comprehensive exam that requires detailed knowledge of data engineering concepts and technologies. It is designed to assess the candidate's ability to apply this knowledge to real-world scenarios, and to design and implement solutions that meet the needs of a wide range of users. Professional-Data-Engineer exam is intended for professionals who have experience working with data engineering technologies and who are looking to advance their careers in this field.
NEW QUESTION # 61
What are two of the benefits of using denormalized data structures in BigQuery?
- A. Reduces the amount of storage required, increases query speed
- B. Increases query speed, makes queries simpler
- C. Reduces the amount of data processed, reduces the amount of storage required
- D. Reduces the amount of data processed, increases query speed
Answer: B
Explanation:
Denormalization increases query speed for tables with billions of rows because BigQuery's performance degrades when doing JOINs on large tables, but with a denormalized data structure, you don't have to use JOINs, since all of the data has been combined into one table. Denormalization also makes queries simpler because you do not have to use JOIN clauses. Denormalization increases the amount of data processed and the amount of storage required because it creates redundant data.
Reference:
https://cloud.google.com/solutions/bigquery-data-warehouse#denormalizing_data
NEW QUESTION # 62
Which of the following are examples of hyperparameters? (Select 2 answers.)
- A. Number of nodes in each hidden layer
- B. Biases
- C. Number of hidden layers
- D. Weights
Answer: A,C
Explanation:
If model parameters are variables that get adjusted by training with existing data, your hyperparameters are the variables about the training process itself. For example, part of setting up a deep neural network is deciding how many "hidden" layers of nodes to use between the input layer and the output layer, as well as how many nodes each layer should use. These variables are not directly related to the training data at all. They are configuration variables. Another difference is that parameters change during a training job, while the hyperparameters are usually constant during a job.
Weights and biases are variables that get adjusted during the training process, so they are not hyperparameters.
Reference: https://cloud.google.com/ml-engine/docs/hyperparameter-tuning-overview
NEW QUESTION # 63
Which methods can be used to reduce the number of rows processed by BigQuery?
- A. Splitting tables into multiple tables; using the LIMIT clause
- B. Splitting tables into multiple tables; putting data in partitions
- C. Putting data in partitions; using the LIMIT clause
- D. Splitting tables into multiple tables; putting data in partitions; using the LIMIT clause
Answer: B
Explanation:
If you split a table into multiple tables (such as one table for each day), then you can limit your query to the data in specific tables (such as for particular days). A better method is to use a partitioned table, as long as your data can be separated by the day.
If you use the LIMIT clause, BigQuery will still process the entire table.
Reference: https://cloud.google.com/bigquery/docs/partitioned-tables
NEW QUESTION # 64
Which Google Cloud Platform service is an alternative to Hadoop with Hive?
- A. BigQuery
- B. Cloud Dataflow
- C. Cloud Bigtable
- D. Cloud Datastore
Answer: A
Explanation:
Explanation
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query, and analysis.
Google BigQuery is an enterprise data warehouse.
Reference: https://en.wikipedia.org/wiki/Apache_Hive
NEW QUESTION # 65
You work for a manufacturing company that sources up to 750 different components, each from a different supplier. You've collected a labeled dataset that has on average 1000 examples for each unique component.
Your team wants to implement an app to help warehouse workers recognize incoming components based on a photo of the component. You want to implement the first working version of this app (as Proof-Of-Concept) within a few working days. What should you do?
- A. Use Cloud Vision AutoML, but reduce your dataset twice.
- B. Use Cloud Vision API by providing custom labels as recognition hints.
- C. Use Cloud Vision AutoML with the existing dataset.
- D. Train your own image recognition model leveraging transfer learning techniques.
Answer: C
NEW QUESTION # 66
Which TensorFlow function can you use to configure a categorical column if you don't know all of the possible values for that column?
- A. categorical_column_with_vocabulary_list
- B. categorical_column_with_hash_bucket
- C. sparse_column_with_keys
- D. categorical_column_with_unknown_values
Answer: B
Explanation:
If you know the set of all possible feature values of a column and there are only a few of them, you can use categorical_column_with_vocabulary_list. Each key in the list will get assigned an auto-incremental ID starting from 0.
What if we don't know the set of possible values in advance? Not a problem. We can use categorical_column_with_hash_bucket instead. What will happen is that each possible value in the feature column occupation will be hashed to an integer ID as we encounter them in training.
Reference: https://www.tensorflow.org/tutorials/wide
NEW QUESTION # 67
You work for an economic consulting firm that helps companies identify economic trends as they happen. As part of your analysis, you use Google BigQuery to correlate customer data with the average prices of the 100 most common goods sold, including bread, gasoline, milk, and others. The average prices of these goods are updated every 30 minutes. You want to make sure this data stays up to date so you can combine it with other data in BigQuery as cheaply as possible. What should you do?
- A. Load the data every 30 minutes into a new partitioned table in BigQuery.
- B. Store the data in a file in a regional Google Cloud Storage bucket. Use Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Google Cloud Storage.
- C. Store the data in Google Cloud Datastore. Use Google Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Cloud Datastore
- D. Store and update the data in a regional Google Cloud Storage bucket and create a federated data source in BigQuery
Answer: C
NEW QUESTION # 68
Which of the following is NOT a valid use case to select HDD (hard disk drives) as the storage for Google Cloud Bigtable?
- A. You will mostly run batch workloads with scans and writes, rather than frequently executing random reads of a small number of rows.
- B. You will not use the data to back a user-facing or latency-sensitive application.
- C. You need to integrate with Google BigQuery.
- D. You expect to store at least 10 TB of data.
Answer: C
Explanation:
For example, if you plan to store extensive historical data for a large number of remote-sensing devices and then use the data to generate daily reports, the cost savings for HDD storage may justify the performance tradeoff. On the other hand, if you plan to use the data to display a real-time dashboard, it probably would not make sense to use HDD storage-reads would be much more frequent in this case, and reads are much slower with HDD storage.
NEW QUESTION # 69
As your organization expands its usage of GCP, many teams have started to create their own projects.
Projects are further multiplied to accommodate different stages of deployments and target audiences.
Each project requires unique access control configurations. The central IT team needs to have access to all projects. Furthermore, data from Cloud Storage buckets and BigQuery datasets must be shared for use in other projects in an ad hoc way. You want to simplify access control management by minimizing the number of policies. Which two steps should you take? Choose 2 answers.
- A. Create distinct groups for various teams, and specify groups in Cloud IAM policies.
- B. Only use service accounts when sharing data for Cloud Storage buckets and BigQuery datasets.
- C. Use Cloud Deployment Manager to automate access provision.
- D. Introduce resource hierarchy to leverage access control policy inheritance.
- E. For each Cloud Storage bucket or BigQuery dataset, decide which projects need access. Find all the active members who have access to these projects, and create a Cloud IAM policy to grant access to all these users.
Answer: A,C
Explanation:
Explanation/Reference:
NEW QUESTION # 70
You are managing a Cloud Dataproc cluster. You need to make a job run faster while minimizing costs, without losing work in progress on your clusters. What should you do?
- A. Increase the cluster size with preemptible worker nodes, and configure them to use graceful decommissioning.
- B. Increase the cluster size with preemptible worker nodes, and configure them to forcefully decommission.
- C. Increase the cluster size with more non-preemptible workers.
- D. Increase the cluster size with preemptible worker nodes, and use Cloud Stackdriver to trigger a script to preserve work.
Answer: A
Explanation:
Explanation/Reference:
Reference https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/flex
NEW QUESTION # 71
Which of these rules apply when you add preemptible workers to a Dataproc cluster (select
2 answers)?
- A. If a preemptible worker is reclaimed, then a replacement worker must be added manually.
- B. A Dataproc cluster cannot have only preemptible workers.
- C. Preemptible workers cannot use persistent disk.
- D. Preemptible workers cannot store data.
Answer: B,D
Explanation:
The following rules will apply when you use preemptible workers with a Cloud Dataproc cluster:
Processing only-Since preemptibles can be reclaimed at any time, preemptible workers do not store data. Preemptibles added to a Cloud Dataproc cluster only function as processing nodes.
No preemptible-only clusters-To ensure clusters do not lose all workers, Cloud Dataproc cannot create preemptible-only clusters.
Persistent disk size-As a default, all preemptible workers are created with the smaller of
100GB or the primary worker boot disk size. This disk space is used for local caching of data and is not available through HDFS.
The managed group automatically re-adds workers lost due to reclamation as capacity permits.
Reference: https://cloud.google.com/dataproc/docs/concepts/preemptible-vms
NEW QUESTION # 72
You are designing a data processing pipeline. The pipeline must be able to scale automatically as load increases. Messages must be processed at least once and must be ordered within windows of 1 hour.
How should you design the solution?
- A. Use Apache Kafka for message ingestion and use Cloud Dataproc for streaming analysis.
- B. Use Cloud Pub/Sub for message ingestion and Cloud Dataproc for streaming analysis.
- C. Use Apache Kafka for message ingestion and use Cloud Dataflow for streaming analysis.
- D. Use Cloud Pub/Sub for message ingestion and Cloud Dataflow for streaming analysis.
Answer: D
NEW QUESTION # 73
When a Cloud Bigtable node fails, ____ is lost.
- A. all data
- B. no data
- C. the time dimension
- D. the last transaction
Answer: B
Explanation:
Explanation
A Cloud Bigtable table is sharded into blocks of contiguous rows, called tablets, to help balance the workload of queries. Tablets are stored on Colossus, Google's file system, in SSTable format. Each tablet is associated with a specific Cloud Bigtable node.
Data is never stored in Cloud Bigtable nodes themselves; each node has pointers to a set of tablets that are stored on Colossus. As a result:
Rebalancing tablets from one node to another is very fast, because the actual data is not copied. Cloud Bigtable simply updates the pointers for each node.
Recovery from the failure of a Cloud Bigtable node is very fast, because only metadata needs to be migrated to the replacement node.
When a Cloud Bigtable node fails, no data is lost
Reference: https://cloud.google.com/bigtable/docs/overview
NEW QUESTION # 74
Your company built a TensorFlow neutral-network model with a large number of neurons and layers. The model fits well for the training data. However, when tested against new data, it performs poorly. What method can you employ to address this?
- A. Dropout Methods
- B. Dimensionality Reduction
- C. Serialization
- D. Threading
Answer: A
Explanation:
Explanation
Reference
https://medium.com/mlreview/a-simple-deep-learning-model-for-stock-price-prediction-using-tensorflow-30505
Topic 1, Flowlogistic Case Study
Company Overview
Flowlogistic is a leading logistics and supply chain provider. They help businesses throughout the world manage their resources and transport them to their final destination. The company has grown rapidly, expanding their offerings to include rail, truck, aircraft, and oceanic shipping.
Company Background
The company started as a regional trucking company, and then expanded into other logistics market. Because they have not updated their infrastructure, managing and tracking orders and shipments has become a bottleneck. To improve operations, Flowlogistic developed proprietary technology for tracking shipments in real time at the parcel level. However, they are unable to deploy it because their technology stack, based on Apache Kafka, cannot support the processing volume. In addition, Flowlogistic wants to further analyze their orders and shipments to determine how best to deploy their resources.
Solution Concept
Flowlogistic wants to implement two concepts using the cloud:
* Use their proprietary technology in a real-time inventory-tracking system that indicates the location of their loads
* Perform analytics on all their orders and shipment logs, which contain both structured and unstructured data, to determine how best to deploy resources, which markets to expand info. They also want to use predictive analytics to learn earlier when a shipment will be delayed.
Existing Technical Environment
Flowlogistic architecture resides in a single data center:
* Databases
* 8 physical servers in 2 clusters
* SQL Server - user data, inventory, static data
* 3 physical servers
* Cassandra - metadata, tracking messages
10 Kafka servers - tracking message aggregation and batch insert
* Application servers - customer front end, middleware for order/customs
* 60 virtual machines across 20 physical servers
* Tomcat - Java services
* Nginx - static content
* Batch servers
Storage appliances
* iSCSI for virtual machine (VM) hosts
* Fibre Channel storage area network (FC SAN) - SQL server storage
* Network-attached storage (NAS) image storage, logs, backups
* Apache Hadoop /Spark servers
* Core Data Lake
* Data analysis workloads
* 20 miscellaneous servers
* Jenkins, monitoring, bastion hosts,
Business Requirements
* Build a reliable and reproducible environment with scaled panty of production.
* Aggregate data in a centralized Data Lake for analysis
* Use historical data to perform predictive analytics on future shipments
* Accurately track every shipment worldwide using proprietary technology
* Improve business agility and speed of innovation through rapid provisioning of new resources
* Analyze and optimize architecture for performance in the cloud
* Migrate fully to the cloud if all other requirements are met
Technical Requirements
* Handle both streaming and batch data
* Migrate existing Hadoop workloads
* Ensure architecture is scalable and elastic to meet the changing demands of the company.
* Use managed services whenever possible
* Encrypt data flight and at rest
* Connect a VPN between the production data center and cloud environment SEO Statement We have grown so quickly that our inability to upgrade our infrastructure is really hampering further growth and efficiency. We are efficient at moving shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand where our customers are and what they are shipping.
CTO Statement
IT has never been a priority for us, so as our data has grown, we have not invested enough in our technology. I have a good staff to manage IT, but they are so busy managing our infrastructure that I cannot get them to do the things that really matter, such as organizing our data, building the analytics, and figuring out how to implement the CFO' s tracking technology.
CFO Statement
Part of our competitive advantage is that we penalize ourselves for late shipments and deliveries. Knowing where out shipments are at all times has a direct correlation to our bottom line and profitability. Additionally, I don't want to commit capital to building out a server environment.
NEW QUESTION # 75
You want to archive data in Cloud Storage. Because some data is very sensitive, you want to use the "Trust No One" (TNO) approach to encrypt your data to prevent the cloud provider staff from decrypting your data. What should you do?
- A. Use gcloud kms keys createto create a symmetric key. Then use gcloud kms encryptto encrypt each archival file with the key and unique additional authenticated data (AAD). Use gsutil cp to upload each encrypted file to the Cloud Storage bucket, and keep the AAD outside of Google Cloud.
- B. Specify customer-supplied encryption key (CSEK) in the .botoconfiguration file. Use gsutil cpto upload each archival file to the Cloud Storage bucket. Save the CSEK in Cloud Memorystore as permanent storage of the secret.
- C. Use gcloud kms keys create to create a symmetric key. Then use gcloud kms encryptto encrypt each archival file with the key. Use gsutil cpto upload each encrypted file to the Cloud Storage bucket.
Manually destroy the key previously used for encryption, and rotate the key once. - D. Specify customer-supplied encryption key (CSEK) in the .botoconfiguration file. Use gsutil cpto upload each archival file to the Cloud Storage bucket. Save the CSEK in a different project that only the security team can access.
Answer: C
Explanation:
Explanation/Reference:
NEW QUESTION # 76
When you design a Google Cloud Bigtable schema it is recommended that you _________.
- A. Create schema designs that are based on a relational database design
- B. Avoid schema designs that require atomicity across rows
- C. Create schema designs that require atomicity across rows
- D. Avoid schema designs that are based on NoSQL concepts
Answer: B
Explanation:
All operations are atomic at the row level. For example, if you update two rows in a table, it's possible that one row will be updated successfully and the other update will fail. Avoid schema designs that require atomicity across rows.
Reference: https://cloud.google.com/bigtable/docs/schema-design#row-keys
NEW QUESTION # 77
......
Verified Professional-Data-Engineer Exam Dumps Q&As - Provide Professional-Data-Engineer with Correct Answers: https://www.free4dump.com/Professional-Data-Engineer-braindumps-torrent.html
Professional-Data-Engineer Exam Questions | Real Professional-Data-Engineer Practice Dumps: https://drive.google.com/open?id=1iQ5yzFGdDsP1JgxyqxzrBXsf8xpuqgi3