Get Spark to use your AWS credentials file for S3
Spark can access files in S3, even when running in local mode, given AWS
credentials. By default, with
s3a URLs, Spark will search for credentials
in a few different places:
- Hadoop properties in
Standard AWS environment variables
EC2 instance profile, which picks up IAM roles
However it will not by default pick up credentials from the
file, which is useful during local development if you are authenticating to
AWS through SAML federation instead of with an IAM user.
The way to make this work is to set the
com.amazonaws.auth.DefaultAWSCredentialsProviderChain, which will
work exactly the same way as the AWS CLI – it will honor the AWS environment
variables as well as the credentials file with the
variable to select from profiles.
It’s not clear to me why this was not included by default, but it’s typically
only an issue for individual developers running Spark in local mode for testing.
Once you’re running on an EMR cluster it’s a non-issue because you will likely
be running with EMRFS anyway with
s3: URLs, and even if you use
pick up IAM roles from the EC2 instance profile.