Wednesday, September 14, 2022

T-SQL Tuesday #154 – SQL Server 2022, IS DISTINCT FROM

Despite my recent job change from full time T-SQL code jockey/architect to website editor for Simple-Talk, I will always be at heart at T-SQL programmer. While I will mostly be writing professionally in the foreseeable future will in support of an article, I will also continue writing code for several databases that I use for multiple personal reason housed on a SQL Server Express server on the desk beside me.

The Invitation: T-SQL Tuesday #154 Invitation – SQL Server 2022 (glennsqlperformance.com)

When I saw the topic of this T-SQL Tuesday was to write about what I have been doing with SQL Server 2022, I figured I would note the two things I have been doing recently. First, I wrote a chapter about security changes in SQL Server 2002 for a forthcoming book entitled “SQL Server 2022 Administration Inside Out” for Pearson with a great group of people that should be out later this year, early next at the latest. There are a few things I found out writing that chapter that I am keen to use, one of them being the more granular UNMASK permissions for the Dynamic Data Masking feature, but since I won’t be writing multi-user production code in the future, I probably won’t be masking any data, much less need granular masking capabilities.

The other thing I have been doing is trying out some of the new features coming in SQL Server 2022 that I will absolutely be using even in my hobby databases. There are tons of new features in 2022, as there always is. But the ones that excite me are the T-SQL improvements. In this article I am going to highlight 1 feature that is immediately a standout.

IS (NOT) DISTINCT FROM

The MOST exciting change from a T-SQL standpoint is: IS NOT DISTINCT FROM. This feature solves an age-old issue for T-SQL programmers and is worth its weight in gold. It is basically an equals comparison operator like =, but treats NULL as an individual value. Unlike =, this new operator returns only TRUE or FALSE, but not UNKNOWN. Writing queries that compare to values that can contain NULL is tedious, mostly because of code like the following:

SELECT CASE WHEN 1 = NULL THEN 'True' Else 'False' end,
       CASE WHEN NOT(1 = NULL) THEN 'True' Else 'False' end

The fact that both of these comparisons return False is confusing at times even to me, and I have written on the whole NULL comparison and negating NULL values things about as many times as I have dealt with it in production code. But using IS DISTINCT FROM, this is no longer the case:

SELECT CASE WHEN 1 IS DISTINCT FROM NULL 
               THEN 'True' Else 'False' end,
       CASE WHEN NOT 1 IS DISTINCT FROM NULL
               THEN 'True' Else 'False' end

Where this is really important is doing a query where you are looking for differences between two sets of data (often for a merge type operation). So consider the following table (from WideWorldImporters, which you can get here) :

SELECT COUNT(*), 
       SUM(CASE WHEN LatestRecordedPopulation IS NULL 
           THEN 1 ELSE 0 END)
FROM Application.Cities;

This returns 37940 total rows and 11048 rows with a NULL population value. Now, let’s join the table to itself on the PK and the population value

SELECT *
FROM   Application.Cities
          JOIN Application.Cities AS C2
            ON C2.CityID = Cities.CityID
WHERE.LatestRecordedPopulation 
                         = Cities.LatestRecordedPopulation;

This returns 26892 rows, which you can do the math, is 37940-11048. Looking at this, without thinking about NULL values (who does initially?), this has to return every row in the table. But clearly not. Usually this becomes obvious when a few customers living in one of those cities isn’t showing up on a report (or maybe even not getting their shipments.)

The pre-SQL Server 2022 way of handling this properly this was to do something like this:

SELECT *
FROM   Application.Cities
          JOIN Application.Cities AS C2
               ON C2.CityID = Cities.CityID
                  AND C2.LatestRecordedPopulation = 
                                      Cities.LatestRecordedPopulation 
                    OR (C2.LatestRecordedPopulation IS NULL
                        AND Cities.LatestRecordedPopulation IS NULL);

Now we have checked the either they are the same value, or they both have a value of NULL. This query returns every row in the table, but it is kind of tricky code. And looking for differences is even more difficult, because you have to check to see if the values are different, if column 1 is null and column2 is not, and again vice versa. Another way this is often done is to change the population comparison to

AND COALESCE(C2.LatestRecordedPopulation.-100) = 
             COALESCE(Cities.LatestRecordedPopulation,-100)

Which is safe from a correctness standpoint (assuming you can coalesce your values to something that is 100% not possible), but not from a performance one. This eliminates index seek utilization for these columns and makes it slower. That isn’t always an issue, but for larger data sets, you may end up with more scans than you hoped.

Using the new syntax, we can simply write this as:

SELECT *
FROM   Application.Cities
         JOIN Application.Cities AS C2
           ON C2.CityID = Cities.CityID
              AND C2.LatestRecordedPopulation 
                    IS NOT DISTINCT FROM 
                           Cities.LatestRecordedPopulation;

The name of the operator might be a little bit confusing because of the words FROM and DISTINCT, it really makes sense. DISTINCT has a seemingly different usage here, but really it is the same meaning. If the value is the same, it is not distinct from one another, and if it is different, it is distinct. And the DISTINCT operator in the SELECT clause honors NULL values as a single bucket too. Now, go back and change the previous query to IS DISTINCT FROM and 0 rows will be returned.

This feature would have saved me many many hours over the years! Is this alone a reason to upgrade to SQL Server 2022 alone? Since I have never been the one to write those checks, and I use the free Express edition for my hobby databases… I can say an unqualified “Yes” to that!

 

The post T-SQL Tuesday #154 – SQL Server 2022, IS DISTINCT FROM appeared first on Simple Talk.



from Simple Talk https://ift.tt/KlvCY6d
via

No comments:

Post a Comment