SQL Server provides the GROUP
BY
clause as a means to take a set of detailed values and summarize them into aggregated values. The detailed records can be aggregated based on one column, as well as expressions, which are specified in the GROUP
BY
clause. This article explores how to return summarized data using the GROUP
BY
and HAVING
clauses.
Test data
In order to demonstrate how to use the GROUP
BY
and HAVING
clauses, some test data will be needed. All of the examples in this article will use a table called Invoice
that is created in tempdb
, using the code in Listing 1.
Listing 1: Code to create test data
USE tempdb; GO CREATE TABLE Invoice ( InvoiceID INT, InvDT DATE, InvAmt DECIMAL(10,2), PaymentDT Date, PaymentAmt DECIMAL(10,2), ProductName VARCHAR(20) ) GO INSERT INTO Invoice VALUES (1,'2021-02-14',121.11,'2021-03-01',121.11,'Thingamajig'), (2,'2021-02-14',159.16,'2021-03-13',159.16,'Watchamacallit'), (3,'2021-02-14',75.59,'2021-03-03',75.16,'Dohicky'), (4,'2021-02-14',121.11,'2021-03-01',121.11,'Thingamajig'), (5,'2021-02-14',75.59,'2021-02-21',75.16,'Dohicky'), (6,'2021-03-03',159.16,Null,null,'Watchamacallit'), (7,'2021-03-11',75.59,'2021-03-30',75.16,'Dohicky'), (8,'2021-02-14',121.11,'2021-03-01',121.11,'Thingamajig'), (9,'2021-03-11',159.16,null,null,'Watchamacallit'); GO
The code in Listing 1 created nine different invoice records in the Invoice
table. Each row contains a single invoice record for one of three different ProductName’s
: Thingamajig
, Watchamacallit
, and Dohicky
. Each row has an InvDT
, and an InvAmt
, where the InvDT
identifies the date the product was sold, and the InvAmt
identifies the amount the product sold for. Each product sells for a different price, but each sale of a specific product sells for the same amount. The PaymentDT
and PaymentAmt
identify the date and amount each payment was made. A couple of the invoices have not been paid, hence why their PaymentDT
and PaymentAmt
are null.
A few different examples will be provided that use the test data in the Invoice
table to demonstrate how the GROUP
BY
and HAVING
clauses work. If you would like to run the example code provided in this article, you can use the code in Listing 1 to create the Invoice table on your instance of SQL Server.
Summarizing based on a single column
This first example shows how to write a GROUP
BY
clause that aggregates the rows in the Invoice
table based on a single column grouping specification. This example calculates the total and average invoice amount for each product identified in the Invoice table by using the code in Listing 2.
Listing 2: Grouping based on a single column
USE tempdb; GO SELECT ProductName, SUM(InvAmt) TotalInvAmt, AVG(InvAmt) AvgInvAmt FROM Invoice GROUP BY ProductName;
Report 1 shows the results when Listing 2 is executed.
Report 1: Results when Listing 2 is run
In Listing 2, the GROUP
BY
clause identifies only a single column named ProductName
. This told SQL Server to group the rows by ProductName when calculating the TotalInvAmt
and the AvgInvAmt
. The TotalInvAmt
for each of the different products was calculated using the SUM
function, whereas the AvgInvAmt
was calculated using the AVG
function. Since there were only three different products sold in the Invoice table, only three summarized amounts were returned, one for each ProductName
.
When a SELECT
statement contains a GROUP
BY
clause, the columns in the select list can only be those columns identified in the GROUP
BY
clause, or column values calculated using an aggregate function. If there is no GROUP
BY
clause, then only aggregate functions can be added to the SELECT
list and ORDER
BY
clause. In this case, the query will return one row and provide summary calculations over the entire result set. If a non-group by column is identified in the select list or ORDER
BY
clause, and it is not contained in an aggregate function, an error similar to the error in Report 2 will be produced.
Report 2: Error received when a column listed that is not contained in an aggregate function
Msg 8120, Level 16, State 1, Line 31 Column 'Invoice.InvDT' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
The code Listing 3 was run to generate the error in Report 2.
Listing 3: Code containing selection list column not contained in aggregate function
USE tempdb; GO SELECT ProductName, InvDT, SUM(InvAmt) TotalInvAmt, AVG(InvAmt) AvgInvAmt FROM Invoice GROUP BY ProductName;
Returning a subset of summarized values
If only a subset of the summarized rows needs to be returned, then a WHERE
constraint can be used to identify the subset. The code in Listing 4 uses the WHERE
clause to only return the summarized values for the ProductName
Dohicky.
Listing 4: Constraining result set using WHERE clause
USE tempdb; GO SELECT ProductName, SUM(InvAmt) TotalInvAmt, AVG(InvAmt) AvgInvAmt FROM Invoice WHERE ProductName = 'Dohicky' GROUP BY ProductName;
When adding the WHERE
constraint to a SQL statement that contains a GROUP
BY clause, the WHERE
clause needs to be placed prior to the GROUP
BY
clause. The results shown in Report 3 are produced when the code in Listing 4 is executed.
Report 3: Output produced when Listing 4 is executed
Grouping based on multiple columns
There are times when data summarization needs to produce summarized values with more granularity than a single column. When this is needed, multiple columns can be specified in the GROUP
BY
criteria, as shown in Listing 5.
Listing 5: Summarizing based on ProductName and InvDT
USE tempdb; GO SELECT ProductName, InvDT, SUM(InvAmt) TotalInvAmt, AVG(InvAmt) AvgInvAmt FROM Invoice GROUP BY ProductName, InvDT;
In Listing 5, two columns, ProductName
and InvDT
, were identified in the GROUP
BY
clause. This tells SQL Server to summarize the test data based on each unique combination of these two column values. Report 4 contains the output created when Listing 5 is executed.
Report 4: Output when Listing 5 is run
Sorting summarized data
You may have noticed that in the prior example, the output in Report 4 was not sorted by ProductName
and InvDT
, even though the data was grouped and summarized by ProductName
and InvDT
. The ordering of the output is based on the unique combination of the grouped columns, and not the individual values of the columns in the GROUP
BY
clause. The only way to guarantee the output’s order is to have an ORDER
BY
clause. The code in Listing 6 groups the data the same as in Listing 5, but the output is produced in a different order because an ORDER
BY
clause has been added in Listing 6.
Listing 6: Sorting summarized data
USE tempdb; GO SELECT ProductName, InvDT, SUM(InvAmt) TotalInvAmt, AVG(InvAmt) AvgInvAmt FROM Invoice GROUP BY ProductName, InvDT ORDER BY ProductName, InvDT;
When using the ORDER
BY
clause, it must be specified after the GROUP
BY
clause. Report 5 shows the output produced when the code in Listing 6 is executed.
Report 5: Output with Listing 6 is run
By using the ORDER
BY
clause, the output in Report 5 is sorted by ProductName
and InvDT
.
Using an expression in the GROUP BY clause
Expressions can also be used in the GROUP
BY
clause to identify the values to summarize on. To show how an expression can be used in the GROUP
BY
clause, look at the code in Listing 7.
Listing 7: Using Expression in GROUP BY clause
USE tempdb; GO SELECT CAST(YEAR(InvDT) as CHAR(4)) + '-' + RIGHT(MONTH(InvDT) + 100,2) YearMonth, SUM(InvAmt) TotalInvAmt , AVG(InvAmt) AvgInvAmt FROM Invoice GROUP BY CAST(YEAR(InvDT) as CHAR(4)) + '-' + RIGHT(MONTH(InvDT) + 100,2);
The code in Listing 7 summarizes the rows in the Invoice
table based on the year and month of the invoice. The year and month values are calculated from the InvDT
using the following expression:
CAST(YEAR(InvDT) as CHAR(4)) + ‘-‘ + RIGHT(MONTH(InvDT) + 100,2);
This expression was specified on the GROUP
BY
clause to identify the values for summarizing. Remember that any column specified in the selection list either needs to be used with an aggregate function or by one of the GROUP
BY
values. Therefore, the expression identified can be included in the selection list, so the year and month values could be displayed in the result set, as shown in Report 6.
Report 6: Output when Listing 7 is run
Grouping NULL values
When comparing two different null values by design are not equal, but when grouping on a column that contain null values, the nulls are grouped together as if they are equal. To demonstrate, run the code in Listing 8.
Listing 8: GROUP ON column with null values
USE tempdb; GO SELECT PaymentDT, MIN(PaymentAmt) MinPaymentAmt FROM Invoice GROUP BY PaymentDT;
Report 7 shows results when Listing 8 is run.
Report 7: Output when Listing 8 is executed
There is a row in Report 7 that has NULL
for PaymentDT
column. This row was created when grouping together the two rows that don’t have a PaymentDT
value. The value of NULL
under the MinPaymentAmt
column occurred because null values are ignored for all aggregate functions, except the COUNT(*)
function.
Because nulls are grouped together, and most aggregate functions ignore null values, summarizing results using a GROUP
BY
clause might create some summarized values that some might consider inaccurate. To see how aggregate functions ignore nulls, review the rows with null values in the sample data by running the code in Listing 9.
Listing 9: Displaying sample data
USE tempdb; GO SELECT ProductName, PaymentAmt FROM Invoice WHERE ProductName = 'Watchamacallit';
Report 8 shows the output created when Listing 9 is executed.
Report 8: Output when Listing 9 is run
In Report 8, the PaymentAmt
is null for two of the rows in the result set. To demonstrate how aggregate functions ignore nulls, consider the code in Listing 10.
Listing 10: Calculating average
USE tempdb; GO SELECT AVG(PaymentAmt) AvgPaymentAmt FROM Invoice WHERE ProductName = 'Watchamacallit';
When Listing 10 is executed, it returns the value 159.160000 for the AvgPaymentAmt
. Is that the correct average amount? If you expected the average to be calculated based on three rows with a ProductName
of Watchamacallit, then it is not correct. Because SQL Server ignores nulls when using the AVG
function, it doesn’t consider the two rows with nulls when determining the average amount. Since nulls are ignored when summarizing the data, the average value of 159.16 is correct.
Using the GROUP
BY
clause in conjunction with the AVG
function to calculate the average for each ProductName
also ignores the null values, as can be seen by reviewing the output created when the code in Listing 11 is executed.
Listing 11: Calculating the average for each ProductName
USE tempdb; GO SELECT ProductName, AVG(PaymentAmt) AvgPaymentAmt FROM Invoice GROUP BY ProductName;
Report 9 shows out output when Listing 11 is run.
Report 9: Output when Listing 10 is executed
If the rows with null values need to be considered when calculating the average, then the nulls need to be converted to 0.00 prior to the AVG
function being processed, as is shown in Listing 12.
Listing 12: Considering rows with a null value when calculating average
USE tempdb; GO SELECT ProductName, AVG(ISNULL(PaymentAmt, 0.00)) AvgPaymentAmt FROM Invoice GROUP BY ProductName;
Report 10 shows the output when Listing 12 is executed.
Report 10: Output when Listing 12 is run
The code in Listing 12 used the ISNULL
function to convert null values for the PaymentAmt
column to 0.00 prior to the AVG
function being applied to calculate the average amount. By doing this, the AVG
function considered all rows with ProductName
Watchamacallit rows when calculating the AvgPaymentAmt
.
HAVING clause
The HAVING
clause is used to filter out summarized rows created by the GROUP
BY
clause. The WHERE
clause is similar to the HAVING
clause. The difference is that the WHERE
clause filters out rows prior to them being summarized, whereas the HAVING
clause filters the rows after they have been summarized into the groups identified in the GROUP
BY
clause. To demonstrate how the HAVING
clause filters out summarized rows, consider the code in Listing 13.
Listing 13: Using the HAVING clause
USE tempdb; GO SELECT ProductName, AVG(ISNULL(PaymentAmt, 0.00)) AvgPaymentAmt FROM Invoice GROUP BY ProductName HAVING AVG(ISNULL(PaymentAmt, 0.00)) > 60;
Report 11 shows output when Listing 13 is run.
Report 11: Output when Listing 13 is executed
The HAVING
clause in Listing 13 only returned summarized rows that had an AvgPaymentAmt
greater than 60. By comparing Report 11 to Report 10, you can see the Whatchamacallit row was removed by the HAVING
clause.
Summarizing data using GROUP BY and HAVING clauses
This article covered how to take detail data and roll it up to produce summarized data using the GROUP
BY
and HAVING
clauses. Data can be summarized based on a single column or multiple columns, and even expressions can be used. Null values cannot be compared because they are unknown values. But when using the GROUP
BY
clause, null values are treated as if they are the same value for aggregation purposes. Using the GROUP
BY
clause does cause SQL Server to sort the detail rows to perform aggregation, but only by including the ORDER
BY
clause can the output order be guaranteed. Next time you need to summarize, filter, and display aggregated data, you will know how to use the GROUP
BY
and HAVING
clauses in conjunction with the WHERE
and ORDER
BY
clauses to produce the rolled-up data you require.
If you liked this article, you might also like The ins and outs of joining tables in SQL Server .
The post Summarizing data using GROUP BY and HAVING clauses appeared first on Simple Talk.
from Simple Talk https://ift.tt/cruK8DX
via
No comments:
Post a Comment