Here we explain how to connect Amazon Glue to a Java Database Connectivity (JDBC) database.
The reason you would do this is to be able to run ETL jobs on data stored in various systems. For example, you could:
In this tutorial, we use PostgreSQL running on an EC2 instance. Glue supports Postgres, MySQL, Redshift, and Aurora databases. To use other databases, you would have to provide your own JDBC jar file.
Unfortunately, configuring Glue to crawl a JDBC database requires that you understand how to work with Amazon VPC (virtual private clouds). I say unfortunately because application programmers don’t tend to understand networking. Amazon requires this so that your traffic does not go over the public internet.
Fortunately, EC2 creates these network gateways (VPC and subnet) for you when you spin up virtual machines. All you need to do is set the firewall rules in the default security group for your virtual machine.
If you do this step wrong, or skip it entirely, you will get the error:
ERROR : At least one security group must open all ingress ports. To limit traffic, the source security group in your inbound rule can be restricted to the same security group
Glue can only crawl networks in the same AWS region—unless you create your own NAT gateway.
Look at the EC2 instance where your database is running and note the VPC ID and Subnet ID.
Go to Security Groups and pick the default one. You might have to clear out the filter at the top of the screen to find that.
Add an All TCP inbound firewall rule. Then attach the default security group ID.
Don't use your Amazon console root login. Use an IAM user. For all Glue operations they will need: AWSGlueServiceRole and AmazonS3FullAccess or some subset thereof.
Your Glue security rule will look something like this:
arn:aws:iam::(XXXX):role/service-role/AWSGlueServiceRole-S3IAMRole
In Amazon Glue, create a JDBC connection. It should look something like this:
Type JDBC JDBC URL jdbc:postgresql://xxxxxx:5432/inventory VPC Id vpc-xxxxxxx Subnet subnet-xxxxxx Security groups sg-xxxxxx Require SSL connection false Description - Username xxxxxxxx Created 30 August 2020 9:37 AM UTC+3 Last modified 30 August 2020 4:01 PM UTC+3
Create a Glue database. This is basically just a name with no other parameters, in Glue, so it's not really a database.
Next, define a crawler to run against the JDBC database. The include path is the database/table in the case of PostgreSQL.
For other databases, look up the JDBC connection string.
Then you run the crawler, it provides a link to the logs stored in CloudWatch. Look there for errors or success.
If you have done everything correctly, it will generate metadata in tables in the database. This is not data. It's just a schema for your tables.
For more tutorials like this, explore these resources: