Enhancing Code Reusability with Python Packages in AWS Glue

Code reusability is key to building efficient, maintainable, and scalable ETL pipelines. If you’re working with AWS Glue, leveraging Python packages can significantly reduce duplication, simplify maintenance, and improve the overall architecture of your data workflows. In this article, we’ll explore how to structure, create, and deploy Python packages for use in AWS Glue.
If you prefer the video version, click here!
The Problem with Duplication
Imagine you’re managing several Glue jobs, each with similar logic for extracting, transforming, and loading (ETL) data. Initially, you might copy and paste code across jobs. However, when updates or bug fixes are needed, you’re forced to manually modify each job, which is time-consuming and error-prone. Using Python packages can solve this issue by centralizing shared logic into reusable components.
Structuring a Python Package
A well-structured Python package forms the backbone of your reusable code. When designing your package, consider the following:
- Base Class: Define shared parameters and functions for tasks like extraction, transformation, and loading.
- Child Classes: Customize extraction and transformation logic for specific use cases.
- Avoid Duplication: For example, if multiple jobs load data to BigQuery, implement the loading logic in the base class and reuse it.
Example Use Case
Let’s say you have three ETL jobs:
- ETL 1 and ETL 2 extract data from S3.
- ETL 3 extracts data from HubSpot.
To avoid duplicating extraction logic for S3, create an intermediate class that inherits from the base class and specializes in extracting data from S3. Each ETL job can then inherit from either the base or the intermediate class, customizing only what’s necessary.
Creating a Python Package
Once your code is structured, you need to package it. Here’s how:
- Folder Structure:
- packages
- setup.py
- package
- base.py
2. Setup File: Use the setup.py file to define package metadata and dependencies. For example:
from setuptools import setup, find_packages
setup(
name="package",
version="1.0",
packages = find_packages()
)
3. Build the Package: Run the following command to generate the distribution files:
python setup.py bdist_wheel
This creates a .whl
file in the dist/
folder, which you can upload to S3.
Configuring AWS Glue to Use Python Packages
To use your package in AWS Glue:
- Upload the Package to S3: Place the
.whl
file in a designated S3 bucket. - Configure Glue Job: Add the S3 path of the package as a parameter in your Glue job under “Python library path”.
For example:
s3://your-bucket-name/packages/my_etl_package-0.1.0-py3-none-any.whl
Automating Deployment with CI/CD
Manually updating your package for every change is tedious. Implementing a CI/CD pipeline can streamline this process. Here’s how:
Prerequisites:
- An AWS IAM user with permissions to upload to S3.
- Access to GitHub Actions (or a similar CI/CD tool).
- GitHub Actions Workflow: Create a
.github/workflows/deploy.yml
file with the following steps:
- Trigger Workflow: Only run the pipeline when changes occur in the
packages/
folder. - Set Up AWS Credentials:
- name: Setting up AWS Credentials
run: |
aws configure set aws_access_key_id ${{ secrets.AWS_ACCESS_KEY_ID }}
aws configure set aws_secret_access_key ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws configure set default.region us-east-2
- Build and Upload Package:
- name: Upload etl package
run: |
cd packages/
python3 setup.py bdist_wheel
aws s3 cp dist/package-${{ env.ETL_PACKAGE_VERSION }}-py3-none-any.whl s3://enabledata/packages/package-${{ env.ETL_PACKAGE_VERSION }}-py3-none-any.whl
- Secure Your Secrets: Store AWS credentials securely using GitHub Secrets.
Summary
Using Python packages with AWS Glue allows you to:
- Centralize and reuse code.
- Reduce maintenance overhead.
- Simplify updates with a CI/CD pipeline.
By following these steps, you’ll ensure your ETL workflows are scalable and efficient.
Bonus: GitHub repository
Looking to enhance your data engineering processes or scale your pipeline architecture? At enabledata.io, we offer expert insights and tailored solutions. Schedule a FREE consulting call today or reach out to us at contact@enabledata.io. Let’s build smarter, scalable solutions together!