Creation of Google Cloud Platform (GCP) Dataproc Workflow templates via Terraform

4 min readMar 6, 2023

Introduction

A basic explanation of different tools and cloud services used is given below along with the links for detailed documentation. Feel free to skip these sections if you already have an idea about the same.

What is Terraform ?

Terraform is an infrastructure as a code tool that lets you define both cloud and on-prem resources in human-readable configuration files that you can version, reuse, and share. You can then use a consistent workflow to provision and manage all of your infrastructure throughout its lifecycle. Terraform can manage low-level components like compute, storage, and networking resources, as well as high-level components like DNS entries and SaaS features.

What is Dataproc ?

Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don’t need them.

What are workflow templates?

A Workflow Template is a reusable workflow configuration. It defines a graph of jobs with information on where to run those jobs

Writing the terraform code

For our use case, we wanted to create an ephemeral cluster, run our jobs to process the data and then delete the dataproc cluster.

The workflow will create an “ephemeral” cluster to run workflow jobs, and then delete the cluster when the workflow is finished. Ephemeral (managed) clusters are easier to configure since they run a single workload.

The terraform file will contain information about the Dataproc Workflow cluster and jobs that need to be run in the cluster with arguments and ‘.jar’ files. It contains the master and worker cluster configuration, as well as the jobs to be executed.

Below is the sample code for the same :

resource "google_dataproc_workflow_template" "template" {
  name = "template-example"
  location = "us-central1"
  placement {
    managed_cluster {
      cluster_name = "my-cluster"
      config {
        gce_cluster_config {
          zone = "us-central1-a"
          tags = ["foo", "bar"]
        }
        master_config {
          num_instances = 1
          machine_type = "n1-standard-1"
          disk_config {
            boot_disk_type = "pd-ssd"
            boot_disk_size_gb = 15
          }
        }
        worker_config {
          num_instances = 3
          machine_type = "n1-standard-2"
          disk_config {
            boot_disk_size_gb = 10
            num_local_ssds = 2
          }
        }
secondary_worker_config {
          num_instances = 2
        }
        software_config {
          image_version = "2.0.35-debian10"
        }
      }
    }
  }
  jobs {
    step_id = "someJob"
    spark_job {
      main_class = "SomeClass"
    }
  }
  jobs {
    step_id = "otherJob"
    prerequisite_step_ids = ["someJob"]
    presto_job {
      query_file_uri = "someuri"
    }
  }
}

The detailed explanation for each component for this snippet is given in the official documentation below: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataproc_workflow_template

Additionally, run-time parameters can be passed to our jobs using the below syntax :

parameters {
    name = "PARAMETER_NAME"
    fields = [
      "jobs['job_name1'].pysparkJob.args[0]",
      "jobs['job_name2'].pysparkJob.args[0]",
      "jobs['job_name3'].pysparkJob.args[0]",
    ]
  }
parameters {
    name = "PARAMETER_NAME_2"
    fields = [
      "jobs['job_name1'].pysparkJob.args[1]",
      "jobs['job_name2'].pysparkJob.args[1]",
      "jobs['job_name3'].pysparkJob.args[1]",
]
  }

PARAMETER_NAME — the name of the parameter which is taken from the user at run-time and passed to the job which will run on dataproc.

jobs[‘job_name1’].pysparkJob.args[0]

The above line from the code means PARAMETER_NAME will be passed to job_name1 as the 0th argument.

Similarly, for PARAMETER_NAME_2, the below code means that PARAMETER_NAME_2 will be passed to job_name1 as the 1st argument.

jobs[‘job_name1’].pysparkJob.args[1]

Once this terraform file is created and executed, you can view your created template by following the below steps :

In this section, you can find the list of all dataproc templates along with their configurations.

Instantiating the templates

Once the template is created, it can be manually instantiated by the gcloud command :

gcloud dataproc workflow-templates instantiate MY_TEMPLATE \
  --region=us-central1 \
  --parameters="param1=value1,param2=value2"

where, MY_TEMPLATE : name of the template; and we can pass the parameters to these functions as well.

Official documentation link : https://cloud.google.com/sdk/gcloud/reference/dataproc/workflow-templates/instantiate

Alternatively, this template can also be triggered programmatically

Thank you.

Developer Contribution → Harshdeep Mishra and Aamir Khan

Creation of Google Cloud Platform (GCP) Dataproc Workflow templates via Terraform

Introduction

Writing the terraform code

Instantiating the templates

Written by Skuad Engineering

No responses yet