Using Auto Loader on Azure Databricks with AWS S3
Problem
Recently on a client project, we wanted to use the Auto Loader functionality in Databricks to easily consume from AWS S3 into our Azure hosted data platform. The reason why we opted for Auto Loader over any other solution is because it natively exists within Databricks and allows us to quickly ingest data from Azure Storage Accounts and AWS S3 Buckets, while using the benefits of Structured Streaming to checkpoint which files it last loaded. It also means we’re less dependent upon additional systems to provide that “what did we last load” context.
We followed the steps on the Microsoft Docs to load files in from AWS S3 using Auto Loader but we were getting an error message that couldn’t be easily resolved in the Azure instance of Databricks:
shaded.databricks.org.apache.hadoop.fs.s3a.AWSClientIOException:
Instantiate shaded.databricks.org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider on :
com.amazonaws.AmazonClientException: No AWS Credentials provided by InstanceProfileCredentialsProvider :
com.amazonaws.SdkClientException:
The requested metadata is not found at http://169.254.169.254/latest/meta-data/iam/security-credentials/:
No AWS Credentials provided by InstanceProfileCredentialsProvider :
com.amazonaws.SdkClientException: The requested metadata is not found at
http://169.254.169.254/latest/meta-data/iam/security-credentials/
Azure doesn’t have notions of an InstanceProfile but AWS does, so marrying the two cloud platforms was going to be a challenge.
Solution
We realised, through trial and error, that the role which had been provisioned for us in AWS would allow us to query data in S3 through Databricks using temporary credentials. The challenge for us would be to allow Databricks, and potentially other services, to use those temporary credentials in a secure and repeatable manner.
We initially went down the route of having the temporary credentials refreshed by Azure Functions and stored in Key Vault, so that other services, like Azure Data Factory could also access them. But after we implemented that approach, we realised that we didn’t need to go down that route at all. If you do need to go down that route, there’s the full approach listed over on UstDoes.Tech.
Due to Databricks python functionality in Pyspark, it was pretty clear that once we got the method working in one area (Functions), we could easily bring it over into Databricks. It’s a much cleaner and straightforward approach than going down the Functions route - especially if you only need Databricks to access those temporary credentials.
Code to get this working in your environment is below:
Conclusion
Configuring Databricks Auto Loader to load data in from AWS S3 is not a straightforward as it sounds - particularly if you are hindered by AWS Roles that only work with temporary credentials.
This is one way of getting it to work inside Databricks and if you need those temporary credentials to be used by other services, there are other approaches.
Cross cloud was never going to be easily but now it is a reality.