Added a pdf illustrating an ER diagram for the database schema provided under fundamentals_task folder
All the queries for the problem statements mentioned in task 2 are added as separate sql files under basicQueries_task folder
Dataset creation problem statement solution query is added the the sql file under datasets_task folder
Please copy all the following contents of dags folder to <container_id>:/usr/local/airflow/dags
- key folder - contains public key for encryption
- sql folder - contains queries used in dags for task 4 and 5
- task_4_5_dag.py file
In dags folder you could find a dag file named task_4_5_dag.py which on execution will pick up the SQL from dags/sql/task_4.sql and perform following tasks:
- Drop table TASK_3_DS if exists
- Execute the Select statement and generate dataset
- Create table TASK_3_DS using generated dataset
I had an option of doing this task in two ways, i.e. using symmetric encryption or using asymmetric encryption. Because asymmetric encrytion is more secure, I have used the same for email column as required in task 5. Unlike symmetric encrytion where encryption and decryption can be done using same key, asymmetric encryption encrypts and decrypts the data using two separate yet mathematically-connected cryptographic keys. I achieved this using PGP_PUB_ENCRYPT and PGP_PUB_DECRYPT functions from pgcrytpo extension in postgres.
For assignment purpose I generated the key pair (public/private) on my end. Ideally it should be generated by the client/data owner and then use that public key to encrypt the data, so that it could be decrypted using the private key they already have. This ways it would be more secure and one need not risk the security by sharing passwords.
Steps to decrypt the encrypted email.
- Open the decryption_query.sql placed under pipeline_encryption_task folder
- Unzip the attached secret.zip and use the value under secret.key for decryption.
- Replace the <private_key> place holder in sql with secret.key file contents and execute.