Adding files from Amazon S3

6th April, 2022

AWS S3 is an object store provided by Amazon. It can store files and structures in any format. Dataedo provides a native connector that can be used to document files in S3 in the following formats:

  • JSON.
  • CSV,
  • Apache Avro,
  • Apache Parquet,
  • Apache ORC,
  • Delta Lake,
  • Microsoft Excel,
  • XML

Prerequisites

IAM User

To document objects stored in S3 with Dataedo, you will need an IAM user with S3 read access which will be used to connect to bucket. To create this user:

  1. Open IAM resource in AWS Console,
  2. Open Users tab,
  3. Click Add Users button,
  4. Set user name,
  5. In Select AWS Credential Type, check Access key – Programmatic access, s3-iam-access
  6. In the Permissions section:
    • select Attach existing policies directly,
    • search for AmazonS3ReadOnlyAccess and check the policy s3-iam-policies
  7. (Optional) Set tags,
  8. Review options and if everything is correct Create User.
  9. After creating user, save the Access Key and the Secret Key, as you will need them later to authenticate to S3 when connecting with Dataedo. Image title

Amazon Resource Name - ARN

Amazon Resource Name (ARN) is a unique identifier of Amazon resource. Dataedo will use it to connect to the selected S3 Bucket. To find ARN:

  1. Open S3 Resource in AWS Console,
  2. Open bucket which contains file(s) you want to document,
  3. Open properties tab,
  4. Copy the ARN value. S3-ARN

Connecting Dataedo to Amazon S3

Dataedo provides two ways to document file(s) in the S3 bucket. You can either Document an object stored in S3 as structure in existing documentation or Add new documentation.

Document an object stored in S3 as structure in existing documentation

Right-click Structures and select Add/Import File/Structure. In opened window select Import from file.

add structure

Select the format of the file to import. If in the next steps you will select more than one file, this will be used as the default choice, although you will be able to select the format for each of the files.

Select Amazon S3 as provider:

amazon-s3-provider

In Select file step, click Connect button and provide connection details to Amazon S3:

  • ARN - Amazon Resource Name which uniquely identifies S3 Bucket,
  • Access Key - key assigned to IAM user which will be used to connect Dataedo to S3 Bucket,
  • Secret Key - password for IAM user.

Obtaining connection details was described in the Prerequisites section. Click Next.

connect-s3-add-structure

In next step, select a file or multiple files to import.

s3-files-list

If you selected only one file, Dataedo will try to read this file and if succeded will open a window with schema and fields to provide details for structure.

s3-structure

For multiple files, Dataedo will try to figure out the format of each file. If failed, you will see an error and have to select the type of a file manually. You can also change the format of a file if the recognized format is wrong.

s3-multiple-files

Add new connection to S3 bucket

To connect to S3 and create a new documentation, click Add documentation and choose Database connection.

Add connection

On the Add documentation window choose Amazon S3:

Amazon S3 on the list

Provide connection details to Amazon S3:

  • ARN - Amazon Resource Name which uniquely identifies S3 Bucket,
  • Access Key - key assigned to IAM user which will be used to connect Dataedo to S3 Bucket,
  • Secret Key - password for IAM user.

Obtaining connection details was described in the Prerequisites section. Click Next.

s3-conn-details

Next screen allows you to change name of the documentation under which it will be visible in Dataedo repository.

s3-doc-title

Select the format of the file to import. If in the next steps you will select more than one file, this will be used as the default choice, although you will be able to select the format for each of the files.

In next step, select a file or multiple files to import.

s3-files-list

If you selected only one file, Dataedo will try to read this file and if succeded will open a window with schema and fields to provide details for structure.

s3-structure

For multiple files, Dataedo will try to figure out the format of each file. If failed, you will see an error and have to select the type of a file manually. You can also change the format of a file if the recognized format is wrong.

s3-multiple-files

Outcome

Your S3 objects have been imported to the repository.

s3-outcome

Data profiling

Dataedo does not support profiling objects stored in Amazon S3.

Found issue with this article? Comment below
0
There are no comments. Click here to write the first comment.