Production Mode¶
Let's take a deep dive on how an user can call the terraglue module to deploy it's own Glue job in AWS.
For this task, let's suppose we want to:
- Deploy a Glue job using a Spark application already available
- Create and associate an IAM role to the job
- Use an already available KMS key from the AWS account to create a Security Configuration
- Define some custom job arguments
Structuring a Terraform Project¶
By essence, the first step to be done is to set up a Terraform project. For this task, it's important to mention that everyone is free to structure a Terraform project the best way they want. To make things as simple as possible, the Terraform project structure below considers the following:
- A
app/
folder to store the Spark application, additional python files and unit tests - A
policy/
folder to store a JSON file that will be used to create an IAM role for the job - A
main.tf
Terraform file to call terraglue module
Let's see it in a tree?
├───app
│ ├───src
│ │ main.py
│ │ utils.py
│ │
│ └───tests
│ test_main.py
│
├───policy
│ glue-job-policy.json
│
│ main.tf
Do I need to follow this exactly project structure in order to work with terraglue?
No, you don't and that's one of the coolest terraglue features. You can take any Terraform project in any structure and call terraglue without any worries.
You will just need to pay attention to the module variables you pass during the call. To see a full list of all acceptable variables, check the Variables section. The Validations section is also a good page to read in order to be aware of some input variable conditions based on specific scenarios.
If you need more information about the structure of a Terraform project you can check the official Hashicorp documentation about it.
Collecting Terraform Data Sources¶
Once we structured the Terraform project, let's start by collecting some Terraform data sources that will be used along the project. Terraform data sources can improve the development of a Terraform project in a lot of aspects. In the end, this is not a required step, but it can be considered as a good practice according to which resources will be declared and which configurations will be applied.
So, let's take our main.tf
file and get the three Terraform data sources stated balow:
- A aws_caller_identity data source to extract the user account id
- A aws_region data source to get the target AWS region
- A aws_kms_key data source to get a KMS key by its alias (assuming that there is a KMS key alias in the target AWS account)
Collecting Terraform data sources
💻 Terraform code:
# Collecting data sources
data "aws_caller_identity" "current" {}
data "aws_region" "current" {}
data "aws_kms_key" "glue" {
key_id = "alias/kms-glue"
}
And now we are ready to call the terraglue module and start customizing it through its variables.
Configuring Terraglue¶
In order to provide a clear vision for users, this demo will be divided into multiple records in different sections. The idea is to delivery a step by step guide showing all customizations applied to terraglue module call using the following topics:
- Calling the module from GitHub
- Setting up IAM variables
- Setting up KMS variables
- Setting up S3 scripts location
- Setting up the Glue job
- Setting up job arguments
By following all demos from each topic, users will be able to fully understand terraglue and all its different ways to deploy Glue jobs.
Calling The Source Module¶
This section is all about showing how to call the terraglue module directly from GitHub.
Calling the terraglue module directly from GitHub
💻 Terraform code:
# Collecting data sources
data "aws_caller_identity" "current" {}
data "aws_region" "current" {}
data "aws_kms_key" "glue" {
key_id = "alias/kms-glue"
}
# Calling terraglue module in production mode
module "terraglue" {
source = "git::https://github.com/ThiagoPanini/terraglue?ref=main"
}
There are more things to setup before deploying terraglue
As stated before in this documentation, terraglue has a lot of variables and most of them has default values. But still there are some things to configure and customize before deploying it in a target AWS account.
Optionally, users can initialize the terraglue module declared through terraform init
command in order to get a simple but huge feature: the autocomplete text in variable names from the module. This can make things a lot easier whe configuring terraglue in the next sections.
Setting Up IAM Variables¶
So, let's start customizing terraglue by setting some IAM variables to guide how the module will handle the IAM role needed to be assumed by the Glue job.
For this demo, let's set the following configurations:
- Inform terraglue that we want to create an IAM role in this project
- Inform terraglue that the IAM policies that will be part of this role are located in the
policy/
folder - Inform terraglue the name of the IAM role to be created
Setting up IAM variables on terraglue
💻 Terraform code:
# Collecting data sources
data "aws_caller_identity" "current" {}
data "aws_region" "current" {}
data "aws_kms_key" "glue" {
key_id = "alias/kms-glue"
}
# Calling terraglue module in production mode
module "terraglue" {
source = "git::https://github.com/ThiagoPanini/terraglue?ref=main"
# Setting up IAM variables
flag_create_iam_role = true
glue_policies_path = "policy"
glue_role_name = "terraglue-demo-glue-role"
}
To see more about all IAM configuration variables available on terraglue, check this link.
Setting Up KMS Variables¶
Well, the next step in this demo will handle KMS key configuration that affects our Glue job. In this project, we will apply the following KMS configurations on terraglue:
- Inform terraglue to now create a KMS key during project deploy (we sill use an existing key)
- Inform terraglue the ARN of the existing KMS key (collected from the
aws_kms_key
Terraform data source declared at the beginning of the project)
Setting up KMS variables on terraglue
💻 Terraform code:
# Collecting data sources
data "aws_caller_identity" "current" {}
data "aws_region" "current" {}
data "aws_kms_key" "glue" {
key_id = "alias/kms-glue"
}
# Calling terraglue module in production mode
module "terraglue" {
source = "git::https://github.com/ThiagoPanini/terraglue?ref=main"
# Setting up IAM variables
flag_create_iam_role = true
glue_policies_path = "policy"
glue_role_name = "terraglue-demo-glue-role"
# Setting up KMS variables
flag_create_kms_key = false
kms_key_arn = data.aws_kms_key.glue.arn
}
To see more about all KMS configuration variables available on terraglue, check this link.
Setting Up S3 Scripts Location¶
After the successfully configuration of IAM and KMS variables, it's time to set up a bucket reference which will be considered by terraglue to store all Glue scripts files in the project.
Basically, this is the step where users provide a bucket name to host the files located in the app/
project folder in order to be used in the Glue job.
In this demo, we will use the aws_caller_identity
and aws_region
data sources collected at the beginning of the project to build a bucket name without hard coding informations such as account ID and AWS region.
Setting up a s3 bucket name to store scripts files
💻 Terraform code:
# Collecting data sources
data "aws_caller_identity" "current" {}
data "aws_region" "current" {}
data "aws_kms_key" "glue" {
key_id = "alias/kms-glue"
}
# Calling terraglue module in production mode
module "terraglue" {
source = "git::https://github.com/ThiagoPanini/terraglue?ref=main"
# Setting up IAM variables
flag_create_iam_role = true
glue_policies_path = "policy"
glue_role_name = "terraglue-demo-glue-role"
# Setting up KMS variables
flag_create_kms_key = false
kms_key_arn = data.aws_kms_key.glue.arn
# Setting up S3 scripts location
glue_scripts_bucket_name = "datadelivery-glue-assets-${data.aws_caller_identity.current.account_id}-${data.aws_region.current.name}"
}
To see more about all S3 configuration variables available on terraglue, check this link.
Setting Up A Glue Job¶
And here we probably have the most important configuration set of a terraglue module call: the Glue job set up.
The idea with this variables block is:
- Inform terraglue to associate a name to the Glue job
- Inform terraglue to associate a description to the Glue job
- Inform terraglue to use G.1X workers
- Inform terraglue to use 5 workers
Setting up a Glue job
💻 Terraform code:
# Collecting data sources
data "aws_caller_identity" "current" {}
data "aws_region" "current" {}
data "aws_kms_key" "glue" {
key_id = "alias/kms-glue"
}
# Calling terraglue module in production mode
module "terraglue" {
source = "git::https://github.com/ThiagoPanini/terraglue?ref=main"
# Setting up IAM variables
flag_create_iam_role = true
glue_policies_path = "policy"
glue_role_name = "terraglue-demo-glue-role"
# Setting up KMS variables
flag_create_kms_key = false
kms_key_arn = data.aws_kms_key.glue.arn
# Setting up S3 scripts location
glue_scripts_bucket_name = "datadelivery-glue-assets-${data.aws_caller_identity.current.account_id}-${data.aws_region.current.name}"
# Setting up Glue
glue_job_name = "terraglue-sample-job"
glue_job_description = "A sample job using terraglue with production mode"
glue_job_worker_type = "G.1X"
glue_job_number_of_workers = 5
}
To see more about all Glue configuration variables available on terraglue, check this link.
Setting Up Job Arguments¶
And finally, it's important to show how users can input their own Glue job arguments on terraglue. In fact, it can be done through the glue_job_args
module variable that accepts a map
object with all user arguments in order to customize the Glue job.
The main key points about the job arguments declared in this demo are:
- Set
--job-bookmark-option
in order to disable job bookmarks from the job - Set
--additional-python-modules
in order to use the sparksnake Python package as an additional python module - Set
--extra-py-files
in order to add a utils.py file uploaded in this same project as an extra Python file to be used in the job
In this step, users are free to set all Glue acceptable arguments. A full list can be found in the AWS official documentation about job parameters.
Setting up Glue job arguments
💻 Terraform code:
# Collecting data sources
data "aws_caller_identity" "current" {}
data "aws_region" "current" {}
data "aws_kms_key" "glue" {
key_id = "alias/kms-glue"
}
# Calling terraglue module in production mode
module "terraglue" {
source = "git::https://github.com/ThiagoPanini/terraglue?ref=main"
# Setting up IAM variables
flag_create_iam_role = true
glue_policies_path = "policy"
glue_role_name = "terraglue-demo-glue-role"
# Setting up KMS variables
flag_create_kms_key = false
kms_key_arn = data.aws_kms_key.glue.arn
# Setting up S3 scripts location
glue_scripts_bucket_name = "datadelivery-glue-assets-${data.aws_caller_identity.current.account_id}-${data.aws_region.current.name}"
# Setting up Glue
glue_job_name = "terraglue-sample-job"
glue_job_description = "A sample job using terraglue with production mode"
glue_job_worker_type = "G.1X"
glue_job_number_of_workers = 5
# Setting up job args
glue_job_args = {
"--job-language" = "python"
"--job-bookmark-option" = "job-bookmark-disable"
"--enable-metrics" = true
"--enable-continuous-cloudwatch-log" = true
"--enable-spark-ui" = true
"--encryption-type" = "sse-s3"
"--enable-glue-datacatalog" = true
"--enable-job-insights" = true
"--additional-python-modules" = "sparksnake"
"--extra-py-files" = "s3://datadelivery-glue-assets-${data.aws_caller_identity.current. account_id}-${data.aws_region.current.name}/jobs/ terraglue-sample-job/app/src/utils.py"
}
}
And with this subsection we reach the end of the demos related to terraglue module configuration.
Running Terraform Commands¶
After all this configuration journey, we now just need to plan and apply the deployment using the respective Terraform commands.
Terraform plan¶
With terraform plan
command, we will be able to see all the resources that will be deployed with the configuration we chose.
Terraform apply¶
And now we can finally deploy the infrastructure declared using the terraform apply
command.
Deployed Resources¶
Well, to finish this demo page, let's see all the resources that were deployed by terraglue module call. In essence, we are talking about:
- An IAM role with permissions specified by a JSON file provided by users in the Terraform project
- Python scripts in a given S3 bucket to be used in a Glue job
- A Glue job with parameters and arguments chosen by users
✅ I hope all the demos can help you somehow on using terraglue to configure and deploy your own Glue job in AWS. Keep reading the docs to become a master user in terraglue!