U-SQL Tutorial: A Practical Approach for Beginners (Part 1)

U-SQL is a query language that combines a declarative SQL-like syntax with C# programming. You can use U-SQL to process both structured and unstructured data in big data environments with Microsoft technologies. This U-SQL tutorial will be the first of a series of beginner tutorials, designed to introduce the basics of U-SQL in a local (free) sandbox environment that allows you to query all your data with a single language, where each data store maintains its own sovereignty, and to use design choices based on the need.

What Is U-SQL and What Are Its Advantages?

According to Microsoft, U-SQL is a language that combines declarative SQL with imperative C# to let you process data at any scale with seamless integration with .NET code and with query capabilities for different data sources, so you can access the data in its current state and form without treatments. U-SQL uses familiar SQL concepts and language to scale out your custom code (.NET/C#/Python) from gigabyte to petabyte scale, offering the usual big data processing concepts such as “schema on reads,” custom processors, and reducers. The language lets you query and combine data from multiple data sources including Azure Data Lake Storage, Azure Blob Storage, Azure SQL DB, Azure SQL Data Warehouse, and SQL Server instances running on Azure VMs.

U-SQL offers many advantages, including:

1. Seamless integration: powerful extensibility of SQL capabilities with your own C# code.

2. Ability to query and merge data from a variety of distinct data sources, including Azure Data Lake Storage, Azure Blob Storage, Azure SQL DB, Azure SQL Data Warehouse, and SQL Server instances running in Azure VMs.

3 Elasticity: Scales easily and efficiently to any size data without requiring major changes  for parallel executions, multi-node optimizations, and scale-out topologies required by other big data languages like Hive.

4. Allows you to query data where it lives (instead of one single repository/location). And seamless integration with Azure SQL DB/DW and SQL Server in a VM, using federated queries “pushed down” to the data source and executed on that data source, with only the results being returned.

U-SQL and SQL: Similar, but With Differences

Most developers with a common SQL background will find U-SQL queries very similar to SQL. Fundamental concepts and some expressions of syntax are common between the languages. Nevertheless, U-SQL is a different language and some expectations from SQL will not be carried over to U-SQL.

How Can You Work With U-SQL?

You can run U-SQL scripts in one of several ways:

1. You can create your own local environment for U-SQL. Azure Data Lake Tools for Visual Studio is required, plus configuration of a local sandbox for runs (Free approach).

2. In Azure using Azure Data Lake Analytics account, this is called “U-SQL Cloud Execution” (note that this is a paid approach).

Getting Started: Your First U-SQL Script

Note: Before continuing, make sure ADLToolsForVS is installed. To verify ADLToolsForVS is installed, go to the Tools menu. If you see an item in that menu called Data Lake then ADLToolsForVS is installed.

Our Sample Script: For our tutorial, we’ll create a simple script to extract data from our input (csv file) using U-SQL extractors. Then we will calculate the average salary by state. Keep in mind, all data used in this tutorial is not real. When doing this project on your own, be sure to have accurate data before starting your project.

1. In Visual Studio, select from menu:  File > New > Project > U-SQL Project.

2. Once the project is created, an empty U-SQL script called “Script.usql” is created as well.

3. Navigate to the USQLDataRoot folder and create a folder called InputFiles. Copy the DataInput.csv file there. It can then be referenced by the U-SQL script.

4. Paste the following code in your script editor:

//Extract values from input
@employeedetails = EXTRACT EmployeeID int,
                      EmployeeName string,
                      State string,
                      Salary decimal,
                      JoiningYear int,
                      Title string
FROM "/InputFiles/DataInput.csv"
//Use in-built CSV extractor to read the CSV file and skip the first row
USING Extractors.Csv(skipFirstNRows:1);
//Query for calculating average on Salary field
@average = SELECT State,
AVG(Salary) AS AverageSalary  
FROM @employeedetails
GROUP BY State;
//specify output file and write the headers to the output file
OUTPUT @average TO "/OutputFiles/AverageSalaryResults.csv"
USING Outputters.Csv(outputHeader:true);

Understanding Your First U-SQL Script, Step-By-Step

Inputs and Outputs

U-SQL scripts transform inputs into outputs. There are different kinds of inputs and outputs but ultimately they resolve to one of two things:

• Files

• U-SQL tables

Inputs and outputs rely and depend on the running environment for the script. During U-SQL Cloud Execution, the inputs/outputs must all be in the cloud. Typically this means Azure Data Lake Store.

For U-SQL Local Execution, the inputs/outputs must all be on your own sandbox. There’s a special name for this location: The U-SQL Local Data Root. This is a local file system directory that can be found using Visual Studio Menu Options: Tools > Data Lake > Options and Settings. It is in the field called DataRoot at the top.

Execute the Script

1. Extract the data from the csv file and then perform the required transformations on the data. Once transformations are complete, the results are finally written to the output file. For extraction of the data, U-SQL provides several different extractors depending on the file type. Here, the input file is in csv format, so a csv extractor is required. To extract the data, supporting C# types are used depending on the type of data elements.

2. To collect the extracted values, the script uses a row-set variable @employeedetails. You might notice that the naming convention for variable declaration is similar to T-SQL naming convention with the @ sign. A row-set variable @employeedetails will store the extracted values.

3. A CSV extractor is used, and to help the extractor to detect that row headers are present on the first row of the file you have to set skipFirstNRows to 1.

4. For transformation, the SQL average function calculates the state-wide average. The results will then be collected into the @average variable.

5. The contents are written out to AverageSalaryResults.csv that will be created in the OutputFiles folder.

6. Outputters.Csv is specified with the formatting using the Outputter command:

7. Finally, after the script runs, you will see the compile summary shown below.

8. By running with the Submit button, you will see detailed information about the job. For instance, the compile view will show you how the data has been processed from step1 to step n. Also, it shows the compile time details and easy to navigate to script options.

The output data should be written to the file /OutputFiles/AverageSalaryResults.csv. You can now go and check the same root directory where you created the InputFiles folder. You can also access the file using the local run results window just by right-clicking the output result path and selecting the Preview option to view the file.

As you can see, the resulting file, AverageSalaryResults.csv, contains the state-wide average salary calculation as shown in the image above.

Wrapping Up the First Part of Our U-SQL Tutorial

In this first part of our U-SQL tutorial, you have successfully:

• Configured a local sandbox for U-SQL in your development environment with a minimal effort and no cost (for free).

• Learned what “Inputs/Outputs” means, how they work and where they must be located (path) for your sandbox.

• Wrote and executed a simple query to read and retrieve data from an input file (CSV format), developing and understanding of the U-SQL processing and execution model.

In the future, we’ll review how to create a live Data Lake in Azure Data Analytics and how to run your U-SQL script in the cloud.

 

Arnoldo Perozo

Arnoldo Perozo is an engineer, a former college professor, and a fan of .NET Technologies. His experience ranges from working with traditional client-server models to the web, mobile, and cloud apps. Currently, Arnoldo is a full-stack developer at Gorilla Logic working with various cloud technologies and .NET Core.

Related Articles

Ready to be Unstoppable?

Partner with Gorilla Logic, and you can be.