AWS Glue
Edited March 1, 2022 by Pankaj Dange and Suresh Kumar Balasundaramsivaprakash
What is Glue and what functionality does it provide for us?
AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue provides all the capabilities needed for data integration so that you can start analyzing your data and putting it to use in minutes instead of months. AWS Glue provides both visual and code-based interfaces to make data integration easier. Users can easily find and access data using the AWS Glue Data Catalog. Data engineers and ETL (extract, transform, and load) developers can visually create, run, and monitor ETL workflows with a few clicks in AWS Glue Studio [3].
When Should You Use It?
- When to do data integration as the process of setting up and putting together data for analytics, application development, and machine learning.
- It seamlessly integrate with AWS services. AWS Glue is integrated across a wide range of AWS services, so it natively supports data stored in Amazon Aurora, Amazon RDS engines, Amazon Redshift, Amazon S3, as well as common database engines and Amazon VPC. This leads to reduced hassle while onboarding.
- Do not want handle operational complexity and focus on business use case, it is serverless, so there are no compute resources to configure and manage. It handles provisioning, configuration, and scaling of the resources required to run your ETL jobs on a fully managed, scale-out Apache Spark environment. This is quite cost-effective as you pay only for the resources used while your jobs are running.
How Do I Learn About it?
- AWS digital training offers free self paced tarining
- https://aws.amazon.com/training/digital/
- AWS Training and Certification Blogs
- https://aws.amazon.com/blogs/training-and-certification/?nc2=sb_bl
FAQ’s
- https://aws.amazon.com/glue/faqs/
Terminologies & Configurations
DPU’s : Data Processing Units
- 1 DPU has 4vCPU and 16Gig Memory
- Apache Spark, Python , Spark Streaming
Security - Support encryption at Rest.
Main Components in Glue
- AWS Glue Data Catalog - This is basically a central repository for your metadata, built to hold information in metadata tables — with each table pointing to a single data store. In other words, it acts as an index to your data schema, location, and runtime metrics, which are then used to identify the targets and sources of your ETL (Extract, Transform, Load) jobs.
- Job Scheduling System - The job scheduling system, on the other hand, is intended to help you automate and chain your ETL pipelines. It comes in the form of a flexible scheduler that’s capable of setting up event-based triggers and job execution schedules.
- ETL Engine - AWS Glue’s ETL engine is the one component that handles ETL code generation. It automatically provides this in Python or Scala, and then proceeds to even give you the option of customizing the code.
Strengths
- Serverless - As a serverless data integration service, AWS Glue saves you the trouble of building and maintaining infrastructure. It is Amazon that provides and manages the servers.
- Automatic Schema Discovery
- Automatic ETL code - AWS Glue is capable of automatically generating ETL pipeline code in Scala or Python — based on your data sources and destination. This not only streamlines the data integration operations but also gives you the privilege of parallelizing heavy workloads.
- Increased data visibility - By acting as the metadata repository for information on your data sources and stores, the AWS Glue Data Catalog helps you keep tabs on all your data assets.
- Developer endpoints - For users who prefer to manually create and test their own custom ETL scripts, AWS Glue facilitates the whole development process through what it calls “developer endpoints.”
- Job scheduling - AWS Glue provides easy-to-use tools for creating and following up job tasks based on schedule and event triggers, or perhaps on-demand.
- Pay-as-you-go - The service doesn’t force you to commit to long-term subscription plans. Instead, you can minimize your usage costs by paying only when you need to use it.
Limitations
- Even it is serverless and managed service it needs expertise in apache spark to do customizations compared to Talend, Informatics or similar ETL tools.
- AWS Glue is good match with AWS services(S3, Redshift, RDS) but it is not easy to use with other non AWS services or multicloud environment.
- It is not ideal for real-time data use case
Alternative Options
- AWS Batch
- AWS DMS - Data Migration Tool
- Databricks
- Informatica PowerCenter
- Alteryx Designer
- Talend Data Fabric
- Oracle GoldenGate
- Splunk
References
- Amazon’s official documentation - https://docs.aws.amazon.com/glue/index.html
- https://www.gartner.com/reviews/market/data-integration-tools/vendor/amazon-web-services/product/aws-glue/alternatives
- https://aws.amazon.com/glue/