Monday, February 10, 2020

But the Database Worked in Development! Checking for Duplicates

The series so far:

Like all grownup database developers, you will do your best to ensure that there are unique constraints and unique indexes in place to prevent duplicates ever getting into the database. These checks are essential, and the RDBMS requires very little effort to make the check when data is inserted. Unless you are working with a very good business analyst, you’ll always miss a few when you create a database, so the addition or alteration of these duplicate checks are an essential part of database maintenance.

Where you were able to do development work on the actual production database, you could check and remove existing duplicates, before you added these constraints, and then provide a data-migration script to go with the deployment. Nowadays, it is more awkward. You’ll probably be required to work on masked data or entirely generated data. You cannot check data if it isn’t exactly what is on production, and it is difficult to predict what could be lurking in the production data.

Now, you have to do all your development work on the constraints and rely on someone else to run the code to remove the duplicates. This is potentially awkward, unless you change your habits to allow easier cooperation between teams. You can, however, very easily generate the code to do these checks, and, even if someone else runs the code, have enough detail about the offending rows to be able to correct the data in a migration script, which can be tested by Ops people on staging.

The same applies to an automated build process. When you’re building a database and then loading in data, it’s much better to check a dataset before constraints are enabled. Once again, what you need is a test script that runs the checks and reports on all the offending rows and only enables constraints if there is no data that would cause an error.

The first article in the series went over the principles of how you might do this as part of a deployment process, with check constraints. Now we move on to checking for duplicates, as defined by your unique indexes.

Reporting on what must be unique in a database

To determine what is defined as being unique in a database, you need to look at both unique constraints that cause a unique index to be created automatically, and unique indexes that are declared directly and explicitly. (All code samples in this series can be found on GitHub.)

DROP PROCEDURE IF EXISTS #ListAllUniqueIndexes
GO
CREATE PROCEDURE #ListAllUniqueIndexes
/**
Summary: >
  This creates a JSON list of all the unique indexes in the database. 
  This includes indexes in the database that have been explicitly 
  declared as well as those that have been automatically created 
  to enforce UNIQUE or PRIMARY KEY constraints
Author: Phil Factor
Date: 12/12/2019
Example:
   - DECLARE @OurListOfUniqueIndexes  NVARCHAR(MAX)
     EXECUTE #ListAllUniqueIndexes 
            @TheJsonList=@OurListOfUniqueIndexes OUTPUT
     SELECT @OurListOfUniqueIndexes AS theUniqueIndexes
   - DECLARE @OurListOfUniqueIndexes  NVARCHAR(MAX)
     EXECUTE #ListAllUniqueIndexes 
            @TheJsonList=@OurListOfUniqueIndexes OUTPUT
     SELECT * FROM OpenJson( @OurListOfUniqueIndexes) 
     WITH (columncount INT, indexname sysname, 
           thetable sysname, columnlist NVARCHAR(4000)
           ,delimitedlist nvarchar(4000))
Returns: >
  the JSON as an output variable
**/
@TheJSONList NVARCHAR(MAX) OUTPUT
AS 
SELECT @TheJSONList=
 (SELECT Count(*) AS columncount, IX.name AS indexname,
        QuoteName(Object_Schema_Name(IX.object_id)) + '.'
        + QuoteName(Object_Name(IX.object_id)) AS thetable,
        String_Agg(QuoteName(col.name), ',') AS columnlist,
        ''''+String_Agg(Replace(col.name,'''',''''''),''',''')+'''' 
               AS delimitedlist
        FROM sys.tables AS tabs
          INNER JOIN sys.indexes AS IX
            ON IX.object_id = tabs.object_id
          INNER JOIN sys.index_columns AS IC
            ON IC.index_id = IX.index_id AND IC.object_id = IX.object_id
          INNER JOIN sys.columns AS col
            ON col.column_id = IC.column_id 
            AND col.object_id = IC.object_id
        WHERE is_unique = 1 -- we only need the ones that 
        --force uniqueness we've chosen to test both the enabled 
        --ones and the disabled ones
        --AND Is_disabled=0
        GROUP BY IX.index_id, IX.object_id, IX.name FOR JSON AUTO)
GO
We can then execute it like this:
DECLARE @OurListOfUniqueIndexes NVARCHAR(MAX);
EXECUTE #ListAllUniqueIndexes 
       @TheJSONList = @OurListOfUniqueIndexes OUTPUT;
SELECT @OurListOfUniqueIndexes;

Note that we’ve chosen to comment out the line that delivers just the enabled constraints. It is a matter of choice, but if an index is there and disabled, I regard it with a certain suspicion.

This will produce a JSON report like this (I’ve just included a few rows):

[
 {
  "columncount": 1,
  "indexname": "PK_SalesTaxRate_SalesTaxRateID",
  "thetable": "[Sales].[SalesTaxRate]",
  "columnlist": "[SalesTaxRateID]",
  "delimitedlist": "'SalesTaxRateID'"
 },
 {
  "columncount": 2,
  "indexname": "AK_SalesTaxRate_StateProvinceID_TaxType",
  "thetable": "[Sales].[SalesTaxRate]",
  "columnlist": "[StateProvinceID],[TaxType]",
  "delimitedlist": "'StateProvinceID','TaxType'"
 },
 {
  "columncount": 1,
  "indexname": "AK_SalesTaxRate_rowguid",
  "thetable": "[Sales].[SalesTaxRate]",
  "columnlist": "[rowguid]",
  "delimitedlist": "'rowguid'"
 },
 {
  "columncount": 2,
  "indexname": "PK_PersonCreditCard_BusinessEntityID_CreditCardID",
  "thetable": "[Sales].[PersonCreditCard]",
  "columnlist": "[BusinessEntityID],[CreditCardID]",
  "delimitedlist": "'BusinessEntityID','CreditCardID'"
 },
 {
  "columncount": 3,
  "indexname": 
       "PK_PersonPhone_BusinessEntityID_PhoneNumber_PhoneNumberTypeID",
  "thetable": "[Person].[PersonPhone]",
  "columnlist": "[BusinessEntityID],[PhoneNumber],[PhoneNumberTypeID]",
  "delimitedlist": "'BusinessEntityID','PhoneNumber'
        ,'PhoneNumberTypeID'"
 }
]

This is one of the reports that I do routinely after a successful build and before data is loaded. This JSON file will usually become part of the build package during deployment. In a sense, it is part of the documentation of the database.

We need to check the data, and report on all the tables and their duplicate rows. We do this after it is loaded and before constraints are enabled. If we are doing a release by synchronizing with an existing version of the data, it is done by synchronizing with a version of the database that has its constraints disabled, and before a post-deployment script enables them.

We can, of course, generate the report and then use it immediately. We do this if we have built the database, ensured that constraints are disabled and then loaded the data.

Here is a batch that produces a JSON report. The batch not only checks for duplicates in the data but also checks that the table and columns mentioned in the constraint exist. If it can’t run the test because the columns aren’t there, then it reports the fact. It makes as few assumptions as possible.

Note that this script must be checked with your security team because actual data for the relevant columns of duplicate rows is stored in the resulting report. There is no way around this, but the risks of ‘leakage’ are low, and you get to be able to keep almost all sensitive data at arm’s length.

I’ve now got this all wrapped up in a temporary stored procedure TestAllUniqueIndexes.sql which is available on Github.

DECLARE @OurListOfUniqueIndexes NVARCHAR(MAX);
EXECUTE #ListAllUniqueIndexes 
        @TheJSONList = @OurListOfUniqueIndexes OUTPUT;
DECLARE @TheUniqueIndexes TABLE --the list of unique 
    --indexes in the database
  (
  TheOrder INT IDENTITY PRIMARY KEY,--needed to iterate 
         --through the table
  ColumnCount INT, --the number of columns used in the index
  IndexName sysname, --the name of the index
  TheTable sysname, --the quoted name of the table with the schema
  ColumnList NVARCHAR(4000), --the list of columns in the index
  DelimitedList NVARCHAR(4000) --the list of columns in the index
 );
INSERT INTO @TheUniqueIndexes (ColumnCount, IndexName, 
        TheTable, ColumnList, DelimitedList)
  SELECT * FROM OpenJson(@OurListOfUniqueIndexes)
  WITH
    (columncount INT, indexname sysname, 
     thetable sysname, columnlist NVARCHAR(4000)
     ,delimitedlist nvarchar(4000)
  );
DECLARE @Breakers TABLE (TheObject NVARCHAR(MAX));
DECLARE @Errors TABLE ([Description] NVARCHAR(MAX));
DECLARE @Duplicates NVARCHAR(MAX); ---list of duplicate rows 
DECLARE @ExecString NVARCHAR(MAX); --The string for that finds 
                                   --the duplicates
DECLARE @indexName sysname; --to hold the value when iterating 
                            --through the result
DECLARE @tablename sysname; --to hold the value when iterating 
                            --through the result
DECLARE @CheckExecString NVARCHAR(MAX); --The string for that 
                                        --checks the cols exist
DECLARE @columnList NVARCHAR(4000); --to hold the value when iterating 
                                    --through the result
DECLARE @columnDuplicates NVARCHAR(MAX);--the duplicate indexes
DECLARE @AllColumnsAndTableThere int
DECLARE @iiMax INT = @@RowCount;
DECLARE @ii INT = 1;
WHILE (@ii <= @iiMax)
  BEGIN
    SELECT @CheckExecString ='SELECT @AllColumnsThere =
  CASE WHEN '+Convert(varchar(3),ColumnCount)+' =
  (
  SELECT Count(*) FROM sys.columns AS c
    WHERE c.name IN ('+delimitedList+')
      AND Object_Id('''+TheTable+''') = c.object_id
  ) THEN 1 ELSE 0 END;
', @ExecString =
      N'SET @duplicates=(SELECT top 50 Count(*) AS duplicatecount, '
      + ColumnList + N' FROM ' + TheTable + N' GROUP BY ' + ColumnList
      + N' HAVING Count(*) >1 FOR JSON auto)', @indexName = IndexName,
      @tablename = TheTable, @columnList = ColumnList
      FROM @TheUniqueIndexes
      WHERE TheOrder = @ii;
    EXECUTE sp_executesql @CheckExecString, 
              N'@AllColumnsThere int output',
      @AllColumnsThere = @AllColumnsAndTableThere OUTPUT;
    if @AllColumnsAndTableThere=1
      BEGIN
      EXECUTE sp_executesql @ExecString, 
              N'@duplicates NVARCHAR(MAX) output',
        @duplicates = @columnDuplicates OUTPUT;
       IF @columnDuplicates IS NOT NULL
        INSERT INTO @Breakers (TheObject)
          SELECT
            (
            SELECT @indexName AS indexName, 
              @tablename AS tablename,
              @columnList AS columnlist, 
              Json_Query(@columnDuplicates) AS duplicates
            FOR JSON PATH, WITHOUT_ARRAY_WRAPPER
            );
      END
    ELSE
       INSERT INTO @errors(description) 
         SELECT 'Table' + @Tablename
                +'either didn''nt exist or didn''t have the column(s)' 
                +@columnlist+ 'so index '+@Indexname+' was untested'
    SELECT @ii = @ii + 1; --do the next unique index in the list
  END;
SELECT (SELECT Json_Query(TheObject) AS duplicated 
        FROM @Breakers FOR  JSON auto) AS duplicatelist,
       (SELECT description FROM @Errors FOR JSON auto) AS errors 
       FOR JSON path;

Testing it out: finding duplicate rows in AdventureWorks

Well, this is all pretty straightforward, but we need to be able to test it. We’ll take poor old AdventureWorks2016, which has some tables that are very handy for the purpose. The Sales.SalesTaxRate table has two unique indexes on top of its surrogate primary key. One of them, AK_SalesTaxRate_rowguid , ensures that the rowguid is unique: The other one, AK_SalesTaxRate_StateProvinceID_TaxType, ensures that there is only one tax rate for any Sales Tax rate, tax type and location. We can try introducing a duplicate:

--disable all constraints
ALTER TABLE Sales.SalesTaxRate NOCHECK CONSTRAINT ALL;
--disable the unique constraints that aren't used as a primary key
ALTER INDEX AK_SalesTaxRate_StateProvinceID_TaxType 
      ON Sales.SalesTaxRate DISABLE;
ALTER INDEX AK_SalesTaxRate_rowguid ON Sales.SalesTaxRate DISABLE;
--deliberately create a duplicate 
INSERT INTO Sales.SalesTaxRate (StateProvinceID, TaxType, TaxRate, 
      Name, rowguid, ModifiedDate)
  SELECT StateProvinceID, TaxType, TaxRate, Name, rowguid, ModifiedDate 
  FROM Sales.SalesTaxRate
  WHERE SalesTaxRateID = 1;

This works because we’ve deliberately disabled the checks before we introduced a duplicate. We can test this by enabling the indexes:

--enable the unique indexes
ALTER INDEX ALL ON Sales.SalesTaxRate REBUILD;
--at this point there is an error
/*
Msg 1505, Level 16, State 1
The CREATE UNIQUE INDEX statement terminated because a duplicate key 
was found for the object name 'Sales.SalesTaxRate' and the index name 
'AK_SalesTaxRate_StateProvinceID_TaxType'. 
The duplicate key value is (1, 1).
*/

So, we’ve now confirmed that this data will trigger an error in a build if we enable the unique constraints and indexes. We can now run the code to check all the tables in the database. Before we do this, however, here’s the code you can run after your tests to return AdventureWorks to its former state:

--we can do this to recover the table
DELETE FROM [Sales].[SalesTaxRate] WHERE SalesTaXrateID>29
--enable the unique indexes
ALTER INDEX ALL ON Sales.SalesTaxRate REBUILD;
--enable constraints
ALTER TABLE Sales.SalesTaxRate WITH CHECK CHECK CONSTRAINT ALL;

Running TestAllUniqueIndexes.sql produces a JSON report that tells us how to fix the problem.

[
  {
    "duplicatelist": [
      {
        "duplicated": {
          "indexName": "AK_SalesTaxRate_StateProvinceID_TaxType",
          "tablename": "[Sales].[SalesTaxRate]",
          "columnlist": "[StateProvinceID],[TaxType]",
          "duplicates": [
            {
              "duplicatecount": 2,
              "StateProvinceID": 1,
              "TaxType": 1
            }
          ]
        }
      },
      {
        "duplicated": {
          "indexName": "AK_SalesTaxRate_rowguid",
          "tablename": "[Sales].[SalesTaxRate]",
          "columnlist": "[rowguid]",
          "duplicates": [
            {
              "duplicatecount": 2,
              "rowguid": "683DE5DD-521A-47D4-A573-06A3CDB1BC5D"
            }
          ]
        }
      }
    ]
  }
]

This JSON report has two parts, the duplicate list and the errors. We can breathe a sigh of relief that there is no error list, which would mean that the database has changed enough that either columns or tables have changed their name. There are two duplications in the list, because two unique indexes have been compromised by the row we inserted. The report tells you what values were duplicates from the perspective of the unique constraint being checked

This information is intended to give the developer enough information to provide a script that corrects the data, but it will need to be tested on a database that has a good realistic generation of data so you can run tests that ensure that only the duplicate rows are affected by your corrections.

Automating the process

Because we’ve kicked into touch the problem of inputs and outputs from procedures by using JSON, we can simplify the process. I’ve designed a process that does as many setup and tear-down SQL files as you need, and which will do the following three essential processes:

  1. Check the new database after loading the data, but before the constraints and indexes are enabled. This runs a check that will tell you what you need to fix and put these into a report, for every database you specify on every server you specify. I realise that things start with just one database on one server but tend to escalate.
  2. Report on the constraints of a database. Check constraints, unique constraints and foreign key constraints are all reported on separately, in a file. The most important message was whether everything was OK, but all potential breaches of the proposed constraint are reported along with errors in the process
  3. Use the report on the constraints to run checks on the data in the specified database.

Logically speaking, processes 2 and 3 are components of process 1. However, you are not always lucky enough, as a developer, to be in charge of loading and checking the data all in one process. Having them as two different processes gives you more flexibility to fit in with the deployment regime.

What we’ve done is to design stored procedures for testing, or gathering data for, the various constraints. Because they all take JSON input and JSON output, we can radically simplify the PowerShell script we use to do the checks. We must keep with a common convention for alerting the user to the presence of rows that would trigger constraint errors if constraints were enabled, and we need a common way of advising if we can’t even run the test.

Because there are a lot of files and code, I’ve created a GitHub site with all the code in it here.

Summary

There are two points in any build, test or deployment where you can get into difficulties with your data because you have duplicates, bad data or data that has lost its relational integrity. Firstly, when you do a build, disable constraints temporarily and Load (BCP) in the data, and secondly, when you synchronize with a version of the database that does more checking of the data. If you have existing bad data, you need a way of fixing it. To do that, you need to know about the data that would fail the constraint tests that your constraints would use if they were enabled.

We need a slightly different ways of testing Check constraints (bad data checks), Unique Constraints (duplicate checks) and Foreign Key constraints (relational integrity checks). We can store the list of constraints from the source database as a JSON file, and we can take this list as a source and store the result of our tests in a JSON report file

To run all these tests in a flexible way that fits in with a wide range in methods of deployment, I’ve devised a general-purpose data-driven way of running these tests and reports in PowerShell.

In my final article I’ll describe how to check foreign key constraints

 

The post But the Database Worked in Development! Checking for Duplicates appeared first on Simple Talk.



from Simple Talk https://ift.tt/2upTWKE
via

No comments:

Post a Comment