27 Mar 2023 - Ravi Pandit
Figure 1: Architecture Diagram
2. Dataset:
For this project, covid-19 dataset will be used, provided by the aws. Structure of dataset and information related to dataset can be studied through this blog:
Only selective datasets are downloaded as shown in the figure:
Figure 2: Dataset
3. Storing downloaded Dataset in Amazon S3:
Figure 3: Storing the downloaded Dataset in Amazon S3
4. Analyzing the data (Schema Crawling and Table Building):
Crawler is created for each and every csv data files where crawler crawls on all the dataset stored in S3 and extract all the schema and information.
Figure 4 : Crawler
After completion of crawler task, Amazon Athena is used to analyze each and every data (studying the datasets) extracted by crawler.
Figure 5: Amazon Athena
5. Building Data Models:
After analyzing the data, to understand the overall flow of the Data, Data Model is created.
Here, fips is primary key in most of the table which can also be studied from the data source of the dataset.
6. Schema:
Here, the fact table is factCovid and there are 3 different dimension table (i.e dimHospital, dimRegion, dimDate). Most of the information is redundant across most of the table in above Data model, thus schema is created because it is more simpler to analyze data in data warehouse when there is less table.
7. Use of Jupyter Notebook:
import boto3
import pandas as pd
from io import StringIO
This function basically take boto3 object and dictionary and run query in Athena and store the output in S3 i.e in staging_dir
Similarly, this process is repeated for all other 9 tables created using crawler shown in figure 4.
Changing Dimension Table(Date) property as explained in figure of schema:
Here, specify the S3 bucket name in place of bucket.
Similarly, this process is repeated for all other 9 tables created using crawler shown in figure 4.
Note: These schema will be needed while creating table in Amazon Redshift.
8. Creating a Redshift Namespace (Cluster)
9. AWS Glue Job
In AWS, AWS Glue Jobs are used for the ETL (Extract, Transform, Load) process to move data from one location to another. AWS Glue is a fully managed ETL service that makes it easy to move data between different data stores and data lakes.
AWS Glue Jobs are used to run ETL scripts and perform data transformations on data stored in various data sources such as Amazon S3, Amazon RDS, Amazon DynamoDB, Amazon Redshift, and more. AWS Glue Jobs can be written in Python or Scala, and they can be run on a serverless infrastructure. This means that AWS Glue Jobs can scale automatically and can handle any amount of data processing.
Table is created in Amazon Redshift with the help create table command and copy command is used to copy the dimension table and fact table from S3 to Redshift. These all process is done by creating a job in AWS Glue Job.
Once the AWS Glue Jobs’ task is completed, data can be queried and viewed in Redshift query editor as shown in figure below:
Finally, this data can be used by Data Analyst and Data Scientist to derive insights and decision making.
Code for Nepal would like to thank DataCamp Donates for providing Ravi, and several other fellows access to DataCamp, to learn and grow.