CloudWiki
Resource

Data Lake Storage

Microsoft Azure
Storage
Azure Data Lake Storage is a cloud-based storage solution that is optimized for big data analytics workloads. It is designed to store and process large amounts of structured and unstructured data at scale and can be used for a variety of data processing scenarios, such as batch processing, real-time stream processing, and interactive querying. Azure Data Lake Storage is built on top of Azure Blob Storage and provides additional features and capabilities specifically for big data workloads. It supports the Hadoop Distributed File System (HDFS) interface, making it compatible with existing Hadoop-based big data tools and applications. It also provides a hierarchical namespace that allows for more efficient data access and management, and supports distributed access control for fine-grained access control and auditing.‍
Terraform Name
terraform
azurerm_storage_data_lake_gen2_filesystem
Data Lake Storage
attributes:

The following arguments are supported:

  • name - (Required) The name of the Data Lake Gen2 File System which should be created within the Storage Account. Must be unique within the storage account the queue is located. Changing this forces a new resource to be created.
  • storage_account_id - (Required) Specifies the ID of the Storage Account in which the Data Lake Gen2 File System should exist. Changing this forces a new resource to be created.
  • properties - (Optional) A mapping of Key to Base64-Encoded Values which should be assigned to this Data Lake Gen2 File System.
  • ace - (Optional) One or more ace blocks as defined below to specify the entries for the ACL for the path.
  • owner - (Optional) Specifies the Object ID of the Azure Active Directory User to make the owning user of the root path (i.e. /). Possible values also include $superuser.
  • group - (Optional) Specifies the Object ID of the Azure Active Directory Group to make the owning group of the root path (i.e. /). Possible values also include $superuser.

NOTE:

The Storage Account requires account_kind to be either StorageV2 or BlobStorage. In addition, is_hns_enabled has to be set to true.

An ace block supports the following:

  • scope - (Optional) Specifies whether the ACE represents an access entry or a default entry. Default value is access.
  • type - (Required) Specifies the type of entry. Can be user, group, mask or other.
  • id - (Optional) Specifies the Object ID of the Azure Active Directory User or Group that the entry relates to. Only valid for user or group entries.
  • permissions - (Required) Specifies the permissions for the entry in rwx form. For example, rwx gives full permissions but r-- only gives read permissions.

More details on ACLs can be found here: https://docs.microsoft.com/azure/storage/blobs/data-lake-storage-access-control#access-control-lists-on-files-and-directories

Associating resources with a
Data Lake Storage
Resources do not "belong" to a
Data Lake Storage
Rather, one or more Security Groups are associated to a resource.
Create
Data Lake Storage
via Terraform:
The following HCL manages a Data Lake Gen2 File System within an Azure Storage Account
Syntax:

resource "azurerm_resource_group" "example" {
 name     = "example-resources"
 location = "West Europe"
}

resource "azurerm_storage_account" "example" {
 name                     = "examplestorageacc"
 resource_group_name      = azurerm_resource_group.example.name
 location                 = azurerm_resource_group.example.location
 account_tier             = "Standard"
 account_replication_type = "LRS"
 account_kind             = "StorageV2"
 is_hns_enabled           = "true"
}

resource "azurerm_storage_data_lake_gen2_filesystem" "example" {
 name               = "example"
 storage_account_id = azurerm_storage_account.example.id

 properties = {
   hello = "aGVsbG8="
 }
}

Create
Data Lake Storage
via CLI:
Parameters:

az dls fs create --account
                --path
                [--content]
                [--folder]
                [--force]

Example:

az dls fs create --account {account} --folder  --path {path}

aws cost
Costs
Direct Cost
Indirect Cost
No items found.
Best Practices for
Data Lake Storage

Categorized by Availability, Security & Compliance and Cost

Low
Access allowed from VPN
No items found.
Low
Auto Scaling Group not in use
No items found.
Medium
Connections towards DynamoDB should be via VPC endpoints
No items found.
Medium
Container in CrashLoopBackOff state
No items found.
Low
EC2 with GPU capabilities
No items found.
Medium
EC2 with high privileged policies
No items found.
Medium
ECS cluster delete alarm
No items found.
Critical
ECS task with Admin access (*:*)
Medium
ECS task with high privileged policies
No items found.
Critical
EKS cluster delete alarm
No items found.
Medium
ElastiCache cluster delete alarm
No items found.
Medium
Ensure Container liveness probe is configured
No items found.
Medium
Ensure ECS task definition has memory limit
No items found.
Critical
Ensure EMR cluster master nodes are not publicly accessible
No items found.
More from
Microsoft Azure