Tuesday, June 27, 2023

Using a Subquery in a SELECT statement

This article is part of Greg Larsen's continuing series on Learning T-SQL. To see all 10 items in the series, click here.

Did you know you can include a SELECT statement within another SELECT statement? When a SELECT statement is embedded within another statement it is known as a subquery. There are two types of subqueries: basic subquery and correlated subquery.

In this article I will be discussing both types of subqueries and will be providing examples of how to use a subquery in different places within in a SELECT statement.

Difference between a basic subquery and a correlated subquery

A basic subquery is a “stand alone” SELECT statement that is embedded inside another SQL statement. By “stand alone”, I mean the SELECT statement that is embedded in another statement can be run independently from the statement in which it is embedded. A correlated subquery on the other hand is also a SELECT statement embedded within another query that has a dependence on the statement in which it is embedded. The correlated subquery cannot be run independently from the statement it is embedded because of the dependency.

A subquery and correlated subquery can be used anywhere in a SQL command that an expression can be used. Both types of subqueries are also known as an “inner query” or “inner select”. Whereas the SQL statement that contains the embedded subquery is known as the outer query. Subqueries are easy to spot because they are contained within a set of parentheses.

Depending on how the inner query is used in relationship to the outer query, will determine how many columns and values a subquery can return. For example, when a subquery is used in a WHERE cause that contains a comparison operator (=, !=, <, >, or >=) the subquery needs to return a single column value.

If the subquery is used with a comparison operator that supports multiple values, like an IN expression, then the subquery can return multiple values. A subquery can also return multiple columns when used in an EXISTS expression or when used in a derived table in the FROM clause.

To better understand how to use these two different types of subqueries in a SELECT statement let me go through a few subquery examples in the following sections.

Test data

All the examples in this article will run against that AdventureWorks2019 OLTP sample database. If you want to follow along and run the examples in this article you will need to download and restore the backup file for AdventureWorks2019 databases. The download link for this backup can be found on this web page, or by downloading using this link.

Ways to use subqueries

In the following sections, I will demonstrate several ways to use a subquery to create richer queries that I have demonstrated before. Subqueries let you have dynamic, data-driven queries where instead of a literal value (or list of values), you can dynamically get a value or list of values from a query.

Using a subquery in a column list

Both types of subqueries can be used in a column list. When a basic or correlated subquery is used in a column list it can only return a single value. The value returned will be incorporated into the result set as a column value in the outer query. To see basic subquery in a column list in action consider the code in Listing 1.

SELECT TerritoryID FROM Sales.Customer
        WHERE CustomerID = 29974;

Listing 1: Code to return a TerritoryID

The code in Listing 1 will return a TerritoryID for CustomerID = 29964. This query will be used as a subquery in Listing 2.

USE AdventureWorks2019; 
GO 
SELECT CustomerID,
       DueDate,        
       TotalDue,        
       (SELECT TerritoryID 
        FROM Sales.Customer         
        WHERE CustomerID = 29974) AS TerritoryID 
FROM Sales.SalesOrderHeader 
WHERE CustomerID = 29974;

Listing 2: Using a subquery in a column list

By Reviewing the code in Listing 2 you can see the code from Listing 1 is embedded in the column list. In Listing 2 the first three column values CustomerID, DueDate and TotalDue are returned from the Sales.SalesOrderHeader table. But the 4th column returned is from the results of the subquery. Or in this case the TerritoryID from the Sales.Customer table.

When the code in Listing 2 is executed the results in Report 1 are produced.

Report 1: Results when Listing 2 is run

The subquery that returns the TerritoryID column value is only evaluated one time, and the values returned from the subquery are then placed on every row returned from the outer query. Additionally worth mentioning is the basic subquery code can be run independently of the code in which it is embedded. Since this code can be run independently it make this embedded SELECT statement a basic subquery, and not a correlated subquery.

A correlated subquery can also be included in a column list. To demonstrate this, refer to Listing 3.

USE AdventureWorks2019;
GO
SELECT TOP 5 C.CustomerID, 
       (SELECT SUM(TotalDue) FROM Sales.SalesOrderHeader H
        WHERE H.CustomerID = C.CustomerID) AS TotalDue
FROM Sales.Customer C
ORDER BY TotalDue DESC;

Listing 3: Correlated subquery in selection list

When the code is Listing 3 is executed the results in Report 2 is displayed.

Report 2: Results when Listing 3 is run

The difference between a basic subquery and a correlated subquery is the correlate subquery references columns from the outer query. In Listing 3 the correlated subquery referenced the column C.CustomerID. The correlated subquery is run once for every row returned from the outer query. The correlated subquery calculates the total amount due for each customer selected. Another difference between a basic query and a correlated subquery is the correlated subquery cannot be run independently of the outer query without getting an error.

When a subquery or correlated subquery is used in the column list the embedded code can only bring back a single column and a single column value. If you try to return multiple columns, or multiple values an error will occur.

To show what will happen when multiple columns are requested in a subquery within a column list, I will run the code in Listing 4.

USE AdventureWorks2019;
GO
SELECT CustomerID,
       DueDate,
       TotalDue,
       (SELECT TerritoryID, StoreID FROM Sales.Customer
        WHERE CustomerID = 29974) AS TerritoryID
FROM Sales.SalesOrderHeader
WHERE CustomerID = 29974;

Listing 4: Trying to bring back multiple column values in subquery code

In the subquery in Listing 4, the StoreID column was added to the subquery. When this code is executed the error in Report 3 is produced.

Msg 116, Level 16, State 1, Line 7
Only one expression can be specified in the select list when the subquery is not introduced with EXISTS.

Report 3: Error when more than one column is identified in a column list subquery.

In order to bring back the StoreID value a second subquery can be added to the original code, as was done in Listing 5.

USE AdventureWorks2019;
GO
SELECT CustomerID,
       DueDate,
       TotalDue,
       (SELECT TerritoryID FROM Sales.Customer
        WHERE CustomerID = 29974) AS TerritoryID,
       (SELECT StoreID FROM Sales.Customer
        WHERE CustomerID = 29974) AS StoreID
FROM Sales.SalesOrderHeader
WHERE CustomerID = 29974;

Listing 5: Having multiple subqueries in a column list.

I’ll leave it up to you to verify that both the TerritoryID and StoreID values are returned using the two different subqueries in Listing 5.

Using a Subquery in a WHERE clause

A basic and correlated subquery can also be used in a WHERE statement. Depending on the operators used in the WHERE constraint will determine the number of values a subquery can return. If the subquery uses one of these operators: “=, !=, <, >, or >=" then only a single value can be returned from the subquery in the WHERE constraint. To show an example of using a basic subquery in the WHERE constraint that uses one of these operators review the code in Listing 6.

USE AdventureWorks2019;
GO
SELECT TOP 5 CustomerID, TerritoryID, TotalDue 
FROM  Sales.SalesOrderHeader
WHERE TerritoryID = (SELECT TerritoryID 
                     FROM Sales.SalesTerritory 
                     WHERE Name = 'Northeast');

Listing 6: Using a subquery in the WHERE constraint.

In Listing 6 the subquery code is associated with the WHERE constraint. In this case the subquery returns the TerritoryID for the “Northeast” region. The TerritoryID returned is then used in conjunction the equals (“=”) operator. When the code in Listing 6 is executed, it produces the output in Report 4.

Report 4: Output when Listing 5 is executed.

The operator used in the WHERE constraint in the prior example can return a single value because the operator “=” was used. But when an operator used in conjunction with the subquery can handle multiple values, then a subquery can bring back more than a single value. To show how multiple values can be used in a correlated subquery the “IN” operator will be used in the subquery in Listing 6.

USE AdventureWorks2019;
GO
SELECT TOP 5 CustomerID, TerritoryID, TotalDue 
FROM Sales.SalesOrderHeader
WHERE TerritoryID IN (SELECT TerritoryID 
                     FROM Sales.SalesTerritory 
                     WHERE Name Like '%east');

Listing 6: Using a correlated subquery with the IN operator

The subquery in listing 6 returns multiple TerritoryID values. Multiple values can be returned because the IN operator supports multiple values. When the code in Listing 6 is executed the results in Report 4 is produced.

Report 4: Output produced when the in operator is used.

Here you can see TerritoryID values of 2 and 5 were returned from the subquery, but since it is not a correlated subquery, you can also simply execute the code in the IN expression, something I frequently do by just highlighting the code in SSMS:

A picture containing text, screenshot, font, line Description automatically generated
The output of this code will be the TerritoryID values, which are, as expected, 2 and 5.

Using a Subquery in a FROM clause

A subquery can also be used in a FROM clause. When a subquery is used in a FROM clause the set created by the subquery is commonly called a derived table, which is stored in memory. In Listing 7 I have used a subquery to return a subset of the SalesOrderHeader rows for a specific date. The derived table is then joined with the SalesOrderDetail, and the Production.Product table to build the final result set.

USE AdventureWorks2019;
GO 
SELECT Header.SalesOrderID, 
       Header.OrderDate,  
       Name, 
       StandardCost 
FROM 
   (SELECT SalesOrderID, 
           OrderDate
    FROM Sales.SalesOrderHeader 
    WHERE OrderDate = '2012-08-21 00:00:00.000') AS Header
     JOIN Sales.SalesOrderDetail AS Detail
       ON Header.SalesOrderID = Detail.SalesOrderID
     JOIN Production.Product AS Product
       ON Detail.ProductID = Product.ProductID;

Listing 7: Subquery in the FROM clause

When the SELECT statement in Listing 7 is execute the results in Report 5 are produced.

Report 5: Results when Listing 7 is executed.

The SalesOrderID and OrderDate are the columns that came from the subquery. Writing a subquery that is used in a FROM clause should be avoid if there is another way to write the query. This is because there are no indexes on a derived table so performance might suffer.

A correlated subquery cannot be used in a FROM clause. The reason is because the correlated subquery cannot be evaluated for every row of the outer query.

Using a Subquery in a HAVING clause

A subquery can also be used in a HAVING clause. To show how a subquery can be used in a HAVING clause let’s first consider the SELECT statement in Listing 8.

USE AdventureWorks2019;
GO
SELECT COUNT(*) AS NumOfOrders
        FROM [Sales].[SalesOrderHeader]
        WHERE OrderDate = '2014-05-01 00:00:00.000';

Listing 8: Number of orders created for a specific date

The SELECT statement in Listing 8 identifies the orders that were created on May 1, 2014, which in this case where 227 orders created on that date. Suppose now you want to identify which OrderDates have more orders than the number of orders created on May 1, 2014 (which is 227). To do that you could take the query in Listing 8 and place it in the HAVING clause as a subquery, as I have done in Listing 9.

USE AdventureWorks2019;
GO
SELECT COUNT(*) AS NumOfOrders, OrderDate 
FROM [Sales].[SalesOrderHeader]
GROUP BY OrderDate
HAVING COUNT(*) >
       (SELECT COUNT(*) AS NumOfOrders
        FROM [Sales].[SalesOrderHeader]
        WHERE OrderDate = '2014-05-01 00:00:00.000')
ORDER BY OrderDate;

Listing 9: Using a subquery in a HAVING clause

When Listing 9 is run the results are shown in Report 6

Report 6: Results when Listing 9 is executed.

By reviewing Report 6 you can see that there are 6 OrderDate values that have processed more orders than on May 1, 2014.

A correlated subquery can also be used in a HAVING clause. To demonstrate this, suppose you want to return the SalesOrderID values in the Sales.SalesOrderHeader table that have more than 70 detailed records. The code in Listing 10 accomplishes this by using a correlated subquery in the HAVING clause.

USE AdventureWorks2019
GO
SELECT H.SalesOrderID 
FROM Sales.SalesOrderHeader AS H
GROUP BY SalesOrderID
-- Having more that 70 detail rows
HAVING (SELECT COUNT(*) 
        FROM Sales.SalesOrderDetail AS D
        WHERE H.SalesOrderID = D.SalesOrderID) > 70;

Listing 10: Correlated subquery in HAVING clause.

When the code in Listing 10 is run the four rows in Report 7 were found.

Report 7: The results when the code in Listing 10 is executed.

Using a subquery in a Function call

A subquery can also be used as a parameter of a function call. The code in Listing 11 shows how a subquery can be used as a parameter to the DATEDIFF function.

USE AdventureWorks2019;
GO
SELECT TOP 3 SalesOrderID,
             CustomerID,
             OrderDate,
             DATEDIFF
               (dd,
                OrderDate, 
                (SELECT Max(OrderDate) 
                 FROM Sales.SalesOrderHeader 
                 WHERE CustomerID = 29610)
                ) NumOfDays
FROM Sales.SalesOrderHeader
WHERE CustomerID = 29610;

Listing 11: Using a subquery in a function call

In Listing 11 the subquery determines the maximum order date for CustomerID = 29610. Which in this case is 2013-02-08. The DATEDIFF function then determines the number of days between that date and other orders for CustomerID 29610, and only returns the TOP three orders. When the code in Listing 11 is executed, the results are shown in Report 8.

Report 8: Results when Listing 10 is executed

Which in this case are the oldest three orders for CustomerID 29610.

Performance Considerations for Basic Subqueries

In the Microsoft documentation about subqueries found here, the following statement regarding performance is mentioned:

In Transact-SQL, there’s usually no performance difference between a statement that includes a subquery and a semantically equivalent version that doesn’t.

To validate that a subquery performs the same as an equivalent version that doesn’t use a subquery let’s review the code in Listing 12.

USE AdventureWorks2019;
GO
-- Subqeury
SELECT OD.*  FROM Sales.SalesOrderDetail AS OD
WHERE ProductID = (SELECT ProductID 
                   FROM Production.Product
                WHERE Name = 'AWC Logo Cap'); 
GO
-- Join query 
SELECT OD.*
FROM Sales.SalesOrderDetail AS OD
INNER JOIN 
Production.Product AS P
ON OD.ProductID = P.ProductID
WHERE P.Name = 'AWC Logo Cap';
GO

Listing 12: Two different but similar SELECT queries

In Listing 12 there are two different SELECT statements. The first SELECT statement uses a subquery to identify which Sales.SalesOrderDetail records to return. The second SELECT query returns uses a JOIN to identify the same set of Sales.SalesOrderDetail records. Both queries are equivalent. Meaning they have the same execution plan and return the same set of records.

Figure 1 show the actual execution plan for the first query, the SELECT statement with the subquery. Whereas Figure 2 shows the execution plan for the SELECT statement that uses the JOIN logic.

Figure 1: Actual execution plan for SELECT statement that contains the subquery

Figure 2: Actual execution plan for the SELECT statement

By reviewing both execution plans you can see that other than the text of the query, the plans are both exactly the same. It should be noted that as your queries grow more and more complex there is a breaking point where it is too costly to find the perfect plan.

Performance Issues when using correlated subqueries

When using a correlated subquery you need to worry about the number of rows in the outer query. When the outer query contains a small number of rows a correlated subquery doesn’t perform too bad. But as volume of rows in the outer query gets larger you will find a correlated subquery start to have performance issues.

This is because the inner query (correlated subquery) has to be evaluated for every row in the outer query. Keep this scale issue in mind when you are testing out correlated subqueries. Make sure you test your code against production size tables prior to promoting a code with a correlated subquery to into your production environment.

Keep in mind that as rowcounts increase, complexity can be a concern as well. For example, you can next subqueries many levels deeps such as:

SELECT <Columns>
FROM    Table1
WHERE  ColumnName IN 
            (SELECT ColumnName
             FROM    Table2
             WHERE  EXISTS (SELECT *
                            FROM Table3
                            WHERE Name IN (
                                   SELECT Name
                                   FROM   Table4…

This may be necessary to answer a question your users have, but at a certain point a query with too many subqueries can become complex to optimize and execute reasonably. In some cases, it may be useful to break your queries up and use temporary tables to capture rows to eliminate some subqueries.

Using a subquery in a SELECT statement

A subquery is a SELECT statement that is embedded in another SQL statement. A subquery can be a stand-alone SELECT statement (meaning it can be run independently of the outer query) or it can be a correlated SELECT statement (meaning it cannot be run independently of the outer query).

Depending on the operator in which a subquery or correlated subquery is used with, the subquery might only be allowed to return a single value or more than one value, and/or column. Subquery performance is typically the same as an equivalent query that uses a JOIN. In this article I only showed how to use subqueries in a SELECT statement. but subqueries can also be used in other statements like INSERT, UPDATE and DELETE.

 

The post Using a Subquery in a SELECT statement appeared first on Simple Talk.



from Simple Talk https://ift.tt/V710pg8
via

No comments:

Post a Comment