Back

AWS Glue API Essential Guide

Aug 7, 20246 minute read

What type of API does AWS Glue provide?

AWS Glue does not have a specific API type like REST, GraphQL, or SOAP. Instead, AWS Glue provides a set of APIs that can be accessed through various AWS SDKs and tools. The key points about AWS Glue's API are:

  1. AWS Glue API: AWS Glue provides a set of APIs that can be accessed programmatically.

  2. Access methods: There are three main ways to interact with AWS Glue programmatically:

    • Language SDK libraries
    • AWS CLI
    • AWS CloudFormation
  3. Web API Reference: The AWS Glue Web API Reference is used by various tools to communicate with AWS.

  4. API Operations: AWS Glue provides a range of API operations, including GetDataCatalogEncryptionSettings, PutDataCatalogEncryptionSettings, PutResourcePolicy, and more.

  5. Data Types: AWS Glue API uses various data types and structures, such as DataCatalogEncryptionSettings, EncryptionAtRest, and ConnectionPasswordEncryption.

  6. Security: AWS Glue API includes security-related operations and structures to manage encryption settings and resource policies.

Does the AWS Glue API have webhooks?

AWS Glue API and Webhooks

The official AWS Glue API does not have traditional webhooks. However, AWS Glue integrates with Amazon EventBridge (formerly known as CloudWatch Events) to provide event-driven automation and notifications.

Types of Events You Can Subscribe To

AWS Glue generates several types of events that you can subscribe to using Amazon EventBridge. Here are the main event types:

  1. Glue Job State Change Events:

    • Generated for job states: SUCCEEDED, FAILED, TIMEOUT, and STOPPED
  2. Glue Job Run Status Events:

    • Generated for job run statuses: RUNNING, STARTING, and STOPPING
    • These events are only generated when they exceed the job delay notification threshold
  3. Glue Crawler State Change Events:

    • Generated for crawler states: Started, Succeeded, and Failed
  4. Glue Data Catalog Database State Change Events:

    • Generated for operations: CreateDatabase, DeleteDatabase, CreateTable, DeleteTable, and BatchDeleteTable
  5. Glue Data Catalog Table State Change Events:

    • Generated for operations: UpdateTable, CreatePartition, BatchCreatePartition, UpdatePartition, DeletePartition, BatchUpdatePartition, and BatchDeletePartition
  6. Glue Data Quality Events:

    • Generated for data quality evaluation results, such as FAILED state

Key Points to Consider

  1. Event-driven architecture: AWS Glue can serve as both an event producer and consumer in an event-driven architecture.

  2. Near real-time delivery: Events from AWS services are delivered to EventBridge in near real-time.

  3. Customizable rules: You can write rules to specify which events are of interest and what automated actions to take when an event matches a rule.

  4. IAM role configuration: To use EventBridge with AWS Glue, you need to create an IAM role with appropriate permissions.

  5. No guaranteed delivery: AWS Glue does not provide guaranteed delivery of EventBridge messages, and you must manage idempotency based on your use case.

Best Practices

  1. Configure EventBridge rules correctly to avoid sending unwanted events.

  2. Use the appropriate IAM roles and policies to ensure secure event handling.

  3. Consider using Amazon EventBridge for setting up alerts and notifications, as it requires a one-time setup and is more flexible than CloudWatch.

  4. When working with data quality events, enable the "Publish metrics to Amazon CloudWatch" checkbox when starting an AWS Glue Data Quality run to ensure metrics are published.

In summary, while AWS Glue doesn't offer traditional webhooks, it provides robust event integration through Amazon EventBridge, allowing you to subscribe to a wide range of events related to jobs, crawlers, data catalog changes, and data quality evaluations.

Rate Limits and other limitations

Based on the search results provided, here are the key points regarding API rate limits for AWS Glue:

API Rate Limiting

AWS Glue API requests are throttled on a per-Region basis for each AWS account to help maintain service performance [5]. When rate limits are exceeded, you may receive errors such as:

  • ThrottlingException
  • Rate exceeded
  • Error Code: ThrottlingException
  • Glue.AWSGlueException with "Rate exceeded" message

These errors indicate that you have exceeded the allowed API request rate for AWS Glue in that particular Region [5].

Best Practices to Avoid Rate Limiting

To mitigate ThrottlingException or rate exceeded errors, AWS recommends the following best practices [5]:

  1. Reduce the frequency of API calls
  2. Stagger the intervals between API calls so they don't all run simultaneously
  3. Use APIs that return multiple values in a single call (e.g. GetPartitions supports 1000 values per call)
  4. Implement error retries and exponential backoff when making API calls
  5. Use AWS CloudTrail console to check which and how many API calls are sent during a given time period

Requesting Quota Increases

If you still encounter rate exceeded errors after implementing the best practices, you can request a service quota increase [5]. Before submitting a request, identify the specific API call causing the error and the current call rate.

Specific Limits

While the search results don't provide exact numbers for most API rate limits, one specific limit mentioned is:

  • Number of jobs per trigger: 50 [4]

Additional Information

  • AWS Glue endpoints and quotas are documented in the AWS General Reference [1][2]
  • Rate limits are Region-specific unless otherwise noted [2]
  • You can contact AWS Support to request quota increases for many service quotas [2]

It's important to note that these limits may change over time, so it's always best to consult the official AWS documentation for the most up-to-date information on API rate limits for AWS Glue.

Latest API Version

Here are the key points about the most recent version of the AWS Glue API:

Latest Version

The most recent version of AWS Glue is AWS Glue 4.0 [1][5].

Key Features of AWS Glue 4.0

  • Updated engines:

    • Python 3.10
    • Apache Spark 3.3.0 [1][5]
  • New and updated features:

    • Native support for open-data lake frameworks like Apache Hudi, Delta Lake, and Apache Iceberg [3]
    • Native support for the Amazon S3-based Cloud Shuffle Storage Plugin [3]
    • Cloud Shuffle Service Plugin for Spark to help scale disk usage [5]
    • Adaptive Query Execution for dynamic query optimization [5]
  • Performance and reliability improvements:

    • Optimized Spark runtime that can be 2-3 times faster than the basic open source version [5]
    • Bug fixes and performance enhancements in both Python and Spark engines [5]

Other Important Points

  • AWS Glue versions are tied to specific versions of Apache Spark and Python [1]
  • Users must select a particular Glue version when creating a job to ensure compatibility [5]
  • Each new version of Glue includes performance and reliability benefits in addition to new features [5]

Best Practices

  • Plan to upgrade Glue jobs over time to take advantage of new features and improvements [5]
  • Validate AWS Glue jobs before migrating across major AWS Glue version releases [1]

In summary, AWS Glue 4.0 is the latest version, offering updated engines, support for additional data formats, and various performance improvements. It's recommended to upgrade jobs to this version to benefit from these enhancements, while ensuring compatibility with existing workflows.

How to get a AWS Glue developer account and API Keys?

To get a developer account for AWS Glue and create an API integration, you'll need to follow these steps:

1. Set up an AWS account

If you don't already have an AWS account, you'll need to create one. This will give you access to AWS services including AWS Glue.

2. Enable AWS Glue in your account

  • Log into the AWS Management Console
  • Navigate to the AWS Glue service
  • Follow the prompts to enable AWS Glue for your account if it's not already enabled

3. Set up IAM permissions

  • Create an IAM user or role with the necessary permissions to use AWS Glue
  • At minimum, you'll need permissions for AWS Glue, as well as related services like S3, CloudWatch, etc.

4. Create an AWS Glue development endpoint (optional)

  • In the AWS Glue console, go to "Dev endpoints" and create a new endpoint
  • This gives you an environment to interactively develop ETL code

5. Use the AWS Glue API

  • You can interact with AWS Glue programmatically using:
    • AWS SDKs for various programming languages
    • AWS CLI
    • AWS CloudFormation

6. Create API integration

  • Determine which external API you want to integrate with
  • Use AWS Lambda to make API calls and process responses
  • Store API credentials securely in AWS Secrets Manager
  • Use AWS Glue jobs to transform and load the API data

7. Orchestrate the pipeline

  • Use AWS Step Functions to orchestrate the end-to-end API data pipeline
  • This allows you to coordinate Lambda functions, Glue jobs, and other AWS services

What can you do with the AWS Glue API?

Based on the AWS Glue API, you can interact with several data models. Here's a list of the key data models and what's possible with each:

AWS Glue Data Catalog

  • Store, index, and search across multiple data sources and sinks [4]
  • Automatically discover and infer schema information using AWS Glue crawlers [4]
  • Manage schemas and permissions for databases and tables [4]
  • Store table definitions, physical locations, and business-relevant attributes for datasets [5]
  • Track how data has changed over time [5]
  • Serve as a drop-in replacement for Apache Hive Metastore [5]

ETL Jobs

  • Define, schedule, and run ETL jobs [3]
  • Visually create, run, and monitor ETL workflows using AWS Glue Studio [5]
  • Generate Scala or Python code for ETL jobs [3]
  • Run jobs on demand or based on specified triggers (time-based or event-based) [3]
  • Clean and transform streaming data in transit [4]
  • Use built-in machine learning for data deduplication and cleansing (FindMatches feature) [4]

Crawlers

  • Define crawlers to populate the AWS Glue Data Catalog with metadata table definitions [3]
  • Automatically infer schema information from various data sources [4]

Connections

  • Connect to a wide variety of data sources, both on-premises and on AWS [4]
  • Integrate with over 80 data sources, including databases in Amazon VPC and streaming sources [1]

AWS Glue DataBrew

  • Visually enrich, clean, and normalize data without writing code [5]
  • Use over 250 built-in transformations for data preparation [5]

AWS Glue Elastic Views

  • Create views over data stored in multiple types of AWS data stores [5]
  • Materialize views in a target data store using PartiQL queries [5]

AWS Glue Schema Registry

  • Manage and enforce schemas for data streams [5]

Sensitive Data Detection

  • Define, identify, and process sensitive data in data pipelines and data lakes [4]

Interactive Sessions and Notebooks

  • Use built-in job notebooks for serverless development environments [4]
  • Interactively explore, debug, and test ETL code using AWS Glue interactive sessions [4]

By leveraging these data models through the AWS Glue API, you can perform a wide range of data integration, preparation, and transformation tasks in a serverless environment, making it easier to work with data for analytics, machine learning, and application development.