Creation of Google Cloud Platform (GCP) Dataproc Workflow templates via Terraform
Introduction
A basic explanation of different tools and cloud services used is given below along with the links for detailed documentation. Feel free to skip these sections if you already have an idea about the same.
What is Terraform ?
What is Dataproc ?
What are workflow templates?
Writing the terraform code
For our use case, we wanted to create an ephemeral cluster, run our jobs to process the data and then delete the dataproc cluster.
The workflow will create an “ephemeral” cluster to run workflow jobs, and then delete the cluster when the workflow is finished. Ephemeral (managed) clusters are easier to configure since they run a single workload.
The terraform file will contain information about the Dataproc Workflow cluster and jobs that need to be run in the cluster with arguments and ‘.jar’ files. It contains the master and worker cluster configuration, as well as the jobs to be executed.
Below is the sample code for the same :
resource "google_dataproc_workflow_template" "template" {
name = "template-example"
location = "us-central1"
placement {
managed_cluster {
cluster_name = "my-cluster"
config {
gce_cluster_config {
zone = "us-central1-a"
tags = ["foo", "bar"]
}
master_config {
num_instances = 1
machine_type = "n1-standard-1"
disk_config {
boot_disk_type = "pd-ssd"
boot_disk_size_gb = 15
}
}
worker_config {
num_instances = 3
machine_type = "n1-standard-2"
disk_config {
boot_disk_size_gb = 10
num_local_ssds = 2
}
}
secondary_worker_config {
num_instances = 2
}
software_config {
image_version = "2.0.35-debian10"
}
}
}
}
jobs {
step_id = "someJob"
spark_job {
main_class = "SomeClass"
}
}
jobs {
step_id = "otherJob"
prerequisite_step_ids = ["someJob"]
presto_job {
query_file_uri = "someuri"
}
}
}
The detailed explanation for each component for this snippet is given in the official documentation below: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataproc_workflow_template
Additionally, run-time parameters can be passed to our jobs using the below syntax :
parameters {
name = "PARAMETER_NAME"
fields = [
"jobs['job_name1'].pysparkJob.args[0]",
"jobs['job_name2'].pysparkJob.args[0]",
"jobs['job_name3'].pysparkJob.args[0]",
]
}
parameters {
name = "PARAMETER_NAME_2"
fields = [
"jobs['job_name1'].pysparkJob.args[1]",
"jobs['job_name2'].pysparkJob.args[1]",
"jobs['job_name3'].pysparkJob.args[1]",
]
}
PARAMETER_NAME — the name of the parameter which is taken from the user at run-time and passed to the job which will run on dataproc.
jobs[‘job_name1’].pysparkJob.args[0]
The above line from the code means PARAMETER_NAME will be passed to job_name1 as the 0th argument.
Similarly, for PARAMETER_NAME_2, the below code means that PARAMETER_NAME_2 will be passed to job_name1 as the 1st argument.
jobs[‘job_name1’].pysparkJob.args[1]
Once this terraform file is created and executed, you can view your created template by following the below steps :
In this section, you can find the list of all dataproc templates along with their configurations.
Instantiating the templates
Once the template is created, it can be manually instantiated by the gcloud command :
gcloud dataproc workflow-templates instantiate MY_TEMPLATE \
--region=us-central1 \
--parameters="param1=value1,param2=value2"
where, MY_TEMPLATE : name of the template; and we can pass the parameters to these functions as well.
Official documentation link : https://cloud.google.com/sdk/gcloud/reference/dataproc/workflow-templates/instantiate
Alternatively, this template can also be triggered programmatically
Thank you.
Developer Contribution → Harshdeep Mishra and Aamir Khan