SQL for Data Engineering

Are you a data engineer or seeking to become one? This blog post is for you. This is the first entry of a series of articles about skills you’ll need in your everyday life as a data engineer. So let’s begin with the first and, in my opinion, the most useful tool in your technical tool belt, SQL.

Listed as one of the top technologies on data engineer job listings in January 2020, Structured Query Language (SQL) is a standardized programming language used to manage relational databases (not exclusively) and perform various operations on the data in them. In our case, we would use it to communicate with the data storage system (which could be a large number of systems like data lakes, databases, or data warehouses) for, mainly, storing and querying data.

Basic SQL

Data cleansing and enrichment processes need to combine, filter, aggregate, and select different sets to answer questions we have. That translates into knowing how to perform different types of JOINs and WHERE clauses to simple SELECT statements. First and foremost, we need to get up to speed with JOINs.

Let’s imagine for a minute that we have customers and orders tables, which are related by the customers_id field:

Join types

INNER JOIN

You want all orders but also want to know the name of the customer who ordered.

 

SELECT a.order_id
     , a.customer_id
     , b.customer_name
FROM example.orders AS a
INNER JOIN example.customers AS b
   ON a.customer_id = b.customer_id;

LEFT “OUTER” JOIN

In this case, you want all orders, even if they have a customer associated or not (let’s not dig into data integrity issues here for the sake of simplicity of this example).

 
SELECT o.order_id
     , o.customer_id
     , c.customer_name
FROM example.orders AS o
LEFT [OUTER] JOIN example.customers AS c
   ON o.customer_id = c.customer_id;

Some database engines do not require you to specify OUTER.

RIGHT “OUTER” JOIN

All customers with or without any orders.

 
SELECT o.order_id
     , o.customer_id
     , c.customer_name
FROM example.orders AS o
RIGHT [OUTER] JOIN example.customers AS c
   ON o.customer_id = c.customer_id;

Order is important for LEFT and RIGHT join. This query will bring us all example.customers rows and whatever matches with example.orders.

LEFT ANTI SEMI JOIN and LEFT OUTER JOIN (with exclusion)

Let’s say you want all customers who haven’t ordered anything yet.

 
SELECT c.customer_id
     , c.customer_name
FROM example.customers AS c
WHERE NOT EXISTS (SELECT o.customer_id
                  FROM example.orders AS o
                  WHERE c.customer_id=o.customer_id)</pre>
<pre class="lang:javascript decode:true">SELECT c.customer_id
     , c.customer_name
FROM example.customers AS c
LEFT [OUTER] JOIN example.orders AS o
   ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL;

Always do performance tests on your queries and subqueries. Join performance depends on table structure, indexes, and generated query plans.

CROSS JOIN

Cross join returns the Cartesian product of both tables, meaning it will bring all possible combinations between all the elements of both sets.

 
SELECT c.customer_id
	, c.customer_name
	, o.order_id
FROM example.orders AS o
CROSS JOIN example.customers AS c

There are other ways to combine datasets, but both query parts need to have the same structure:

If you want to change granularity levels on your data and also summarize it, go here and here to start your learning path.

Many core concepts can be learned from websites like W3C. If you want to challenge yourself with practical exercises, you can take a look here too.

Advanced SQL

There are a lot of interesting functions to transform and add properties to your dataset and take advantage of our DBMS capabilities.

Window functions

Window functions are very useful if you want to run a calculation on a set of rows that are related in some way (ie. you want a running total of the order total price for a customer using order_date as the reference column for time).

SELECT a.order_id
	, a.customer_id
	, SUM(a.order_total) OVER (PARTITION BY a.customer_id ORDER BY a.order_date) AS running_order_total_price
FROM example.orders AS a;

Another example could be if you want a separate column that indicates the date of the customer’s previous order.

SELECT a.order_id
	, a.customer_id
	, a.order_date
	, LAG(a.order_date, 1) OVER (PARTITION BY a.customer_id ORDER BY a.order_date) AS previous_order_date
FROM example.orders AS a;

These are just two basic examples of what you can do with window functions. You can find a good jump start here and here, or, if you want something deeper, check this out.

With SQL, you can also work with complex data types like arrays and JSON objects. You’ll want to take into account that syntaxis and implementations of these functions may vary between different database engines or big data environments, such as SQL Server, Oracle, PostgreSQL, MySQL, Redshift, Hive, BigQuery, and so on.

CTE (Common Table Expression)

One great feature you can find in many SQL systems is CTEs (Common Table Expressions). This is very useful if you want to simplify complex subqueries and joins like recursive relationships grouped on the same dataset (think hierarchical data stored on a table related to itself).

Basically, CTE is a temporary result set that you can reference with another SELECT, INSERT, UPDATE, or DELETE statement. You can significantly increase readability and simplification by deconstructing queries into simple blocks and reusing them. Let me show you what I am talking about.

I want to extract some columns from the customers table from customers who had orders in 2021, with discounts equal to or greater than $100, and who have a registered address in the U.S.

Query in ordinary form:

SELECT DISTINCT c.customer_name
	, c.customer_id
	, c.customer_address
	, c.customer_phone
FROM example.orders AS o
INNER JOIN example.customer AS c
    ON o.customer_id = c.customer_id
WHERE o.order_date BETWEEN DATE '2021-01-01' AND DATE '2021-12-31'
AND (c.customer_address LIKE '%US%'
    OR c.customer_address LIKE '%USA%'
    OR c.customer_address LIKE '%United States%')
AND o.order_discount >= 100

Formatted query:

WITH customers_in_USA AS 
(
	SELECT c.customer_name
		, c.customer_id
		, c.customer_address
		, c.customer_phone
	FROM example.customer AS c
      WHERE c.customer_address LIKE '%US%'
      OR c.customer_address LIKE '%USA%'
      OR c.customer_address LIKE '%United States%'
),
orders_from_2021 AS 
(
      SELECT DISTINCT customer_id	
      FROM example.orders AS o
      WHERE o.order_date BETWEEN DATE '2021-01-01' AND DATE '2021-12-31'
      AND o.order_discount >= 100
)

SELECT a.customer_name
	, a.customer_id
	, a.customer_address
	, a.customer_phone
FROM customers_in_USA AS a
INNER JOIN orders_from_2021 AS b
    ON a.customer_id = b.customer_id;

In case you want to create a recursive query (a query that references itself), CTEs can be a real ally in this battle. Let’s say we have a reference program for our customers that is actually stored on our customer’s table, but we want to know who referenced who…well, let’s take a look:

WITH referred_customers
( 
      SELECT customer_id
	     , customer_name
	     , referred_by_id
	     , CAST(NULL AS STRING) AS referred_by_name
      FROM example.customers
	WHERE referred_by_id IS NULL
	
      UNION ALL

      SELECT c.customer_id
	     , c.customer_name
	     , c.referred_by_id
	     , r.customer_name AS referred_by_name
      FROM example.customers AS c
      INNER JOIN referred_customers AS r
	   ON c.referred_by_id = r.customer_id
)

SELECT customer_id
	, customer_name
	, referred_by_id
      , referred_by_name
FROM referred_customers;

Want to learn more about CTEs? Please check here, here, and here.

There are plenty of things that we can call Advanced SQL, but here is a brief list to consider:

Efficient Queries

After you have mastered a lot of SQL concepts and functions, you might like to write efficient queries to have great response times and make yourself proud while deploying your code on production environments.

This may not always be easy, but with time and patience, you can become proficient. Here are some recommendations to help you get started:

Always remember, your query performance will depend on table structures, indexes, data types, and query plans (all SQL systems have query optimizers, but behavior may vary from vendor to vendor).

I hope this post gives you an idea about where to start and the key elements behind writing queries on SQL, as this will become your bread and butter as a data engineer. My intention is to give you a glance of what you would need and tear down the first technical barrier to your learning experience. Thanks for reading and see you on the next one.