In Search of Happiness: A Quick ETL Use Case with AWS Glue + Redshift
** Note: The development team was also formed by Andres Palavicini and Luis Diego Carvajal.
In today’s world, AWS is becoming an essential development skill. If you already know a little about it, you can identify its biggest advantage: you do what’s important for your business while we take care of where and how it is running. It’s as simple as that; you can host virtually any service you may need within AWS’s cloud service catalog. That is exactly what you often need when it comes to analytics, BI and the Big Data world: lots of clustered servers running scripts to transform huge data sets so you can process and visualize the data. Can you imagine running long, heavy ETL jobs with only-God-knows-what infrastructure that you don’t need to worry about?
So, when we talk about Extract, Load and Transform (ETL) jobs, what service does AWS offer? Glue is the answer to your prayers.
AWS Glue is a fully managed ETL service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console; after that, you simply point Glue to your data stored on AWS, and it stores the associated metadata (e.g. table definition and schema) in the Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL. AWS Glue generates the code to execute your data transformations and data loading processes (as per AWS Glue homepage).
A Gorilla Logic team took up the challenge of using, testing and gathering knowledge about Glue to share with the world. Here is what we stated as our use case:
What makes YOU happy vs. what Happy Planet Index says about your country
We started thinking: what is the relation between what makes people happy in their day-to-day activities and each country’s Happy Planet Index (HPI) regarding different topics?
Are the topics people are happier about in their daily life related to a high ranked topic in the HPI?
To answer this, we grouped our use case into 6 phases:
- We got the Happiness Comments database from the Kaggle project. They are CSV files, so in order to explore different types of sources for the data, we loaded the demographics data from HappyDB into a PostgreSQL DB for testing both S3 and DB data sources for a later crawling.
- We went on to get the HPI data at http://happyplanetindex.org/countries. This is an Excel file sheet that we converted to CSV for consistent file typing.
- The data was set! Our next step was to crawl all of the data into AWS Glue catalogues.
- After the crawling was done, we created a Python script for transforming and loading the resulting data into a Redshift cluster.
- Finally, we queried the data from Redshift and explored the relations between what people said made them happy versus what the HPI says about their respective countries.
Kaggle’s Happiness Comments database
The first dataset we got was Kaggle’s Happiness Comments database. Each person listed in the database had been given the following question to respond to: What made you happy today? Reflect on the past 24 hours, and recall three actual events that happened to you that made you happy. Write down your happy moment in a complete sentence (gotten from Kaggle’s website). There are two files: cleaned_hm.csv (which contains all of the comments) and demographic.csv (which links every comment to the nationality of the person who expressed it, among other characteristics). In order to test both types of sources, we loaded the demographic.csv data into a PostgreSQL database for later use, and uploaded the cleaned_hm.csv into a S3 bucket.
HPI data
As per the Happy Planet Index site, “The Happy Planet Index measures what matters: sustainable wellbeing for all. It tells us how well nations are doing at achieving long, happy, sustainable lives.” We took their Excel sheet with all of the HPI data and converted it into a CSV format for consistent file typing. We then uploaded this CSV file into a S3 bucket for later use.
The crawling process
“Glue” is a set of processes to design, build and maintain ETL jobs, all in one place. A good Glue workflow easily explains it:
We had already followed the first 3 steps of this workflow, so after getting the data into its repositories (S3 and PostgreSQL), the next step was to crawl it into the Glue catalogues. But before that, we needed to create the connections by going to Databases -> Connections, clicking on “Add connection” and following the wizard:
The crawling process was done through the Crawlers menu:
At this point we had set up the HPI for reading the HPI file, Happy_Comments for reading the CSV file with the comments, Happy_Demographics for loading PostgreSQL data, and an additional Redshift data source for getting all the data at the end from Redshift. There is also a wizard to set this up, which is very easy to follow. An important thing here is to make sure to use the correct IAM Role when creating the crawler.
Once the crawler is done, run it. It should create the database you specified during the wizard setup, and it also automatically detects the schema’s tables and their data types.
ETL script
Once the catalogue was defined and full of enough data, it was time to create the magic behind the data!
Under ETL -> Jobs, we were able to create the jobs that were going to consume the data from the catalogues. As with everything here, there is a wizard that helps you create a code template or add a code snippet to access a catalogue.
You can then add the data sources to use them for reading and writing purposes. An example of the code we used is as follows:
Glue offers its own set of classes for optimized data processing. They are usually based on the Apache Hadoop and Spark projects, so any code you already may have in Spark or Hadoop for big data can be easily adapted here and even improved by using Glue classes.
Time to see the numbers!
We finally joined all of the data and wrote it to Redshift, so now we can query it and see which topics show a correlation. As an example, we looked at “ecological footprint” and “Gross Domestic Product (GDP) per capita” in the United States, a simple query that gave us the following results:
Query for GDP per capita:
Query for ecological footprint:
A quick analysis tell us two things: the USA does very well in average income (8th highest in the world) but can improve a lot in its ecological footprint (136th position worldwide). In addition, when comparing this data with the comments of the people from this country, we can see that “achievement” was a robust category U.S. Americans consider that makes them happy; usually this is related to job promotions, business activities and personal goals reached, which can contribute to a very good GDP per capita.
On the other hand, U.S. Americans also consider that “nature” is not something that makes them as happy as many other topics, a fact that is reflected in their low HPI ranking (136) for this specific topic.
Conclusion
This little experiment showed us how easy, fast and scalable it is to crawl, merge and write data for ETL processes using Glue, a very good service provided by Amazon Web Services. Glue is able to discover a data set’s structure, load it into it catalogue with the proper typing, and make it available for processing with Python or Scala jobs. It also uses Apache Spark libraries and its own Glue API to make the transformation process very robust. Furthermore, it gives you the ability to run such scripts and automatically scale in or out depending on the performance needed, so you don’t need to worry about infrastructure.
Glue works perfect with Hadoop projects, and you can easily import any project that uses Spark into it. Finally, this service also distributedly computes the data transformed in the script, just as Hadoop does, so any Hadoop crawler can easily get the different parts of the resulting data and use it at the developer’s convenience (hint: you can also use Glue for this and keep everything together).