CloudWiki
Resource

Glue

Amazon Web Services
Analytics
Amazon Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to move data between data stores. It automatically discovers and categorizes your data, then suggests schemas for it and keeps track of your data as it changes over time. With Amazon Glue, you can create and run an ETL job with a few clicks in the AWS Management Console. The service handles provisioning, monitoring, and maintenance of the resources needed to run your ETL jobs. Amazon Glue is designed to be used with Amazon S3 and Amazon Redshift, but it can also be used with other data stores. The service is serverless, so you pay only for the resources you use and there is no need to provision or manage infrastructure.
Terraform Name
terraform
aws_glue_catalog_database
Glue
attributes:

The following arguments are supported:

  • catalog_id - (Optional) ID of the Glue Catalog to create the database in. If omitted, this defaults to the AWS Account ID.
  • create_table_default_permission - (Optional) Creates a set of default permissions on the table for principals. See create_table_default_permission below.
  • description - (Optional) Description of the database.
  • location_uri - (Optional) Location of the database (for example, an HDFS path).
  • name - (Required) Name of the database. The acceptable characters are lowercase letters, numbers, and the underscore character.
  • parameters - (Optional) List of key-value pairs that define parameters and properties of the database.
  • target_database - (Optional) Configuration block for a target database for resource linking. See target_database below.

target_database

  • catalog_id - (Required) ID of the Data Catalog in which the database resides.
  • database_name - (Required) Name of the catalog database.

create_table_default_permission

  • permissions - (Optional) The permissions that are granted to the principal.
  • principal - (Optional) The principal who is granted permissions.. See principal below.

principal

  • data_lake_principal_identifier - (Optional) An identifier for the Lake Formation principal.

Associating resources with a
Glue
Resources do not "belong" to a
Glue
Rather, one or more Security Groups are associated to a resource.
Create
Glue
via Terraform:
The following HCL creates a Glue Catalog Database Resource with default permissions
Syntax:

resource "aws_glue_catalog_database" "aws_glue_catalog_database" {
 name = "MyCatalogDatabase"

 create_table_default_permission {
   permissions = ["SELECT"]

   principal {
     data_lake_principal_identifier = "IAM_ALLOWED_PRINCIPALS"
   }
 }
}

Create
Glue
via CLI:
Parameters:

create-database
[--catalog-id <value>]
--database-input <value>
[--tags <value>]
[--cli-input-json | --cli-input-yaml]
[--generate-cli-skeleton <value>]
[--debug]
[--endpoint-url <value>]
[--no-verify-ssl]
[--no-paginate]
[--output <value>]
[--query <value>]
[--profile <value>]
[--region <value>]
[--version <value>]
[--color <value>]
[--no-sign-request]
[--ca-bundle <value>]
[--cli-read-timeout <value>]
[--cli-connect-timeout <value>]
[--cli-binary-format <value>]
[--no-cli-pager]
[--cli-auto-prompt]
[--no-cli-auto-prompt]

Example:

aws glue create-database \
   --database-input "{\"Name\":\"tempdb\"}" \
   --profile my_profile \
   --endpoint https://glue.us-east-1.amazonaws.com

aws cost
Costs
The cost of using Glue depends on several factors, including the amount of data processed, the number of ETL jobs run, and the number of data catalog API requests made. For data processing, you are charged for the number of Data Processing Units (DPUs) used to run your ETL jobs. Each DPU provides a certain amount of computing and memory resources, and the cost per DPU-Hour. For the data catalog, you are charged for the number of API requests made and the amount of data stored in the catalog.
Direct Cost

$1 per 1,000,000 requests for AWS Glue Data Catalog request

$ per Request for Catalog-Request:Request in <Region>

Indirect Cost
No items found.
Best Practices for
Glue

Categorized by Availability, Security & Compliance and Cost

Low
Access allowed from VPN
No items found.
Low
Auto Scaling Group not in use
No items found.
Medium
Connections towards DynamoDB should be via VPC endpoints
No items found.
Medium
Container in CrashLoopBackOff state
No items found.
Low
EC2 with GPU capabilities
No items found.
Medium
EC2 with high privileged policies
No items found.
Medium
ECS cluster delete alarm
No items found.
Critical
ECS task with Admin access (*:*)
Medium
ECS task with high privileged policies
No items found.
Critical
EKS cluster delete alarm
No items found.
Medium
ElastiCache cluster delete alarm
No items found.
Medium
Ensure Container liveness probe is configured
No items found.
Medium
Ensure ECS task definition has memory limit
No items found.
Critical
Ensure EMR cluster master nodes are not publicly accessible
No items found.
More from
Amazon Web Services