Tuesday, December 18, 2018

Constraining and checking JSON Data in SQL Server Tables

So you have a database with JSON in it. Can you validate it? I don’t mean just to ensure that it is valid JSON, but ensure that the JSON contains values that are legitimate. Are NI values, postcodes or bank codes valid? Can the dates or GUIDs be successfully parsed? Are those integers really integers? Are any of those dates of birth possible for a person who could be alive today? Are those part numbers valid?

The short answer is that if the JSON document represents a list or a table, then it is easy: But if it is a complex hierarchy, you are better off doing it in PowerShell using JSON Schema. In this article we’ll be showing how it is possible to use a mixture of ordinary constraints and table variables to achieve clean data.

Document columns

In relational databases, each column holds an ‘atomic’ value, meaning that it isn’t a collection of data items such as a list of values. A geographic location may considered to be a collection of coordinate values but from the data perspective, it is just a location: nobody is ordinarily interested in latitude or longitude or anything else, so from the perspective of the database users, the geographic location can be treated as ‘atomic’. You check it as such.

Putting a data document into a column is a slippery slope. By ‘data document’, I’m referring to a structured string from which more than one data element can be stored or extracted. A list is a simple example. An XML document or fragment is another one, and so is JSON. Whereas, with an ordinary relation column you can enforce constraints on what can get stored there, or on an XML document with an XML schema, this isn’t so with a JSON document or a list. If circumstances force you to store JSON Data in a table and it is not ‘atomic’, it is your responsibility to check that the data within it is clean. In other words, If you have to store such a thing, then you have to know what sort of data is in there, and must prevent other data being stored in there. You must, in fact, ‘constrain’ the data.

Why the need to constrain values?

Why? This is because the data is very easily presented in a relational format in views. There is no ‘chinese wall’ between JSON Data and your well-checked and disciplined relational data. It is there and it might be wrong. If you fail to enforce the rules that underlie any datatype, then you can get errors in all sorts of places, including, heaven help us, financial reporting. To take a silly example, if someone slips a new weekday into your innocuous list of days of the week, your daily revenues will be wrong. An inadvertent negative value in revenue figures can cause executives to run up and down corridors shouting. It is not only financial data that has to be checked. If your organisation gets a request to remove someone’s data, you can remove their record and the associated rows in the associated tables. How do you know if their data hasn’t leaked into other tables? Maybe you have a data document that is saved whenever an employee interacts with that customer? What if you decide to mask your data for development work and it turns out that a customer can be identified from a data document containing ‘associated details’ in an address table.

The managers of any organisation are legally obliged to know where personal data is stored, and that it is held responsibly and securely. They must allow customers, members, clients or whatever to inspect, amend and have it deleted. The organisation relies on you to ensure that this corresponds to reality.

Constraints in databases aren’t a luxury. Your job, and the health of the organisation that employs you, depends on ensuring that your date is as clean as automation can manage.

Validating a data-document column

Leaving JSON to one side for a moment, let’s take the simplest sort of data document: a list.

Let’s imagine that you have a list of postcodes. You want to insert it in the table to represent a delivery route for the route of a carrier’s van. (we leave out all the other columns because they’re unlikely to be relevant.)

CREATE TABLE deliveryRoutes (
  routeid INT IDENTITY,
  Driverid int,
  DriverRoute varchar(2000))
;

You can check a postcode such as 'CB4 0WZ' like this

SELECT 
  CASE 
   WHEN 'CB4 0WZ' LIKE '[A-Z][A-Z0-9] [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]'
     OR 'CB4 0WZ' LIKE '[A-Z][A-Z0-9]_ [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]'
     OR 'CB4 0WZ' LIKE '[A-Z][A-Z0-9]__ [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]'
   THEN 1 ELSE 0 END;

You can create a function to test a list of these postcodes like this:

Select dbo.IsValidPostcodelist (
  'SK1 3AU,BN27 3D,BT7 3GQ,NR1 3SR,SE9 5LB,PO4 0PX,BT78 5LU,W1F 7JR,B46 2QP,BL8 4AA,NR2 4AZ,CM 96G')

Here is the code. I’m using the new string_split() function but you can do one of the older techniques

CREATE FUNCTION [dbo].[IsValidPostcodeList]
/**
Summary: >
  this routine takes a list of postcodes, comma-delimited and rerturns 0 if any 
  of them is invalid. It returns 1 if the entire list of postcodes is valid
Author: phil factor
Date: 14/12/2018
Examples:
   - Select dbo.IsValidPostcodelist ('SK1 3AU,BN27 3D,BT7 3GQ,NR1 3SR,SE9 5LB,PO4 0PX,BT78 5LU,W1F 7JR,B46 2QP,BL8 4AA,NR2 4AZ,CM 96G')
   - Select dbo.IsValidPostcodelist ('SK1 3AU,BT7 3GQ,NR1 3SR,SE9 5LB,PO4 0PX,BT78 5LU,W1F 7JR,B46 2QP,BL8 4AA,NR2 4AZ')
Returns: >
  an integer returns 1 if the entire list of postcodes is valid
**/
(
    @postcodes Varchar(2000)
)
RETURNS INT
AS
BEGIN
IF EXISTS( SELECT * FROM STRING_SPLIT(@postcodes, ',')  
WHERE CASE WHEN value LIKE '[A-Z][A-Z0-9] [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]'
             OR value  LIKE '[A-Z][A-Z0-9]_ [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]'
             OR value  LIKE '[A-Z][A-Z0-9]__ [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]'
               THEN 1 ELSE 0 END=0)
   RETURN 0
RETURN 1
END

With this function, we can now add a constraint to our table that checks the postcode. If you insert a drivers route that contains an invalid postcode, you get an error. We try it out.

/* create a test table that uses this constraint */
IF Object_Id('dbo.deliveryRoutes') IS NOT NULL
        DROP TABLE deliveryRoutes
CREATE TABLE deliveryRoutes (
  routeid INt IDENTITY,
  Driverid int,
  DriverRoute varchar(2000) 
  CONSTRAINT for_Valid_Postcodes_In_List CHECK (dbo.IsValidPostcodelist(DriverRoute)=1),
  TheDate Date DEFAULT getdate());

-- this won't go in because a couple of these    
INSERT INTO DeliveryRoutes (Driverid, driverroute) 
 SELECT 354,'SK1 3AU,BN27 3D,BT7 3GQ,NR1 3SR,SE9 5LB,PO4 0PX,BT78 5LU,W1F 7JR,B46 2QP,BL8 4AA,NR2 4AZ,CM 96G'
/*
Msg 547, Level 16, State 0, Line 78
The INSERT statement conflicted with the CHECK constraint "for_Valid_Postcodes_In_List". 
The conflict occurred in database "business", table "dbo.deliveryRoutes", column 'DriverRoute'.
The statement has been terminated.
*/
INSERT INTO DeliveryRoutes (Driverid, driverroute) 
 SELECT 354,'SK1 3AU,BT7 3GQ,NR1 3SR,SE9 5LB,PO4 0PX,BT78 5LU,W1F 7JR,B46 2QP,BL8 4AA,NR2 4AZ'
--that second one worked well

Try these out, altering the postcodes slightly and you’ll see when one or more becomes invalid.

Now, With this function, we can check the list to make sure that every member of the list is a valid postcode.

What if your developers are storing such a thing in a JSON string rather than a list? Doing this is probably better as a JSON String because you can store null values in it and the parsing is quicker.

We can check it just as easily. Firstly we create the function to do it.

CREATE OR ALTER FUNCTION [dbo].[IsValidPostcodeJSONList]
/**
Summary: >
  this routine takes a JSON list of postcodes, and rerturns 0 if any 
  of them is invalid. It returns 1 if the entire list of postcodes is valid
Author: phil factor
Date: 14/12/2018
Examples:
   - Select dbo.IsValidPostcodeJSONlist ('
      ["SK1 3AU","BN27 3D","BT7 3GQ","NR1 3SR","SE9 5LB","PO4 0PX",
      "BT78 5LU","W1F 7JR","B46 2QP","BL8 4AA","NR2 4AZ","CM 96G"]')
   - Select dbo.IsValidPostcodeJSONlist ('
      ["SK1 3AU","BT7 3GQ","NR1 3SR","SE9 5LB","PO4 0PX","BT78 5LU",
      "W1F 7JR","B46 2QP","BL8 4AA","NR2 4AZ"]')
Returns: >
  an integer returns 1 if the entire list of postcodes is valid
**/
(
    @postcodes Varchar(2000)
)
RETURNS INT
AS
BEGIN

IF IsJson(@postcodes)=0 return -1
IF EXISTS (SELECT * FROM OPENJSON(@postcodes)
      WHERE CASE WHEN value LIKE '[A-Z][A-Z0-9] [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]'
             OR value  LIKE '[A-Z][A-Z0-9]_ [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]'
             OR value  LIKE '[A-Z][A-Z0-9]__ [0-9][ABD-HJLNP-UW-Z][ABD-HJLNP-UW-Z]'
               THEN 1 ELSE 0 END=0)return 0
RETURN 1
END

Now we have the function, we can use this in an almost identical table

/* create a test table that uses this constraint */
IF Object_Id('dbo.deliveryRoutesJSON') IS NOT NULL
        DROP TABLE deliveryRoutesJSON

CREATE TABLE deliveryRoutesJSON (
  routeid INt IDENTITY,
  Driverid int,
  DriverRoute NVARCHAR(MAX) 
  CONSTRAINT for_Valid_JSON_Postcodes_In_List CHECK (dbo.IsValidPostcodeJSONList(DriverRoute)=1),
  TheDate Date DEFAULT getdate());

INSERT INTO DeliveryRoutesJSON (Driverid, driverroute) 
 SELECT 354,'["SK1 3AU","BT7 3GQ","NR1 3SR","SE9 5LB","PO4 0PX","BT78 5LU",
      "W1F 7JR","B46 2QP","BL8 4AA","NR2 4AZ"]'
 --this first one worked well
 
 -- this won't go in because a couple of these are wrong   
INSERT INTO DeliveryRoutesJSON (Driverid, driverroute) 
 SELECT 354,'["SK1 3AU","BN27 3D","BT7 3GQ","NR1 3SR","SE9 5LB","PO4 0PX",
      "BT78 5LU","W1F 7JR","B46 2QP","BL8 4AA","NR2 4AZ","CM 96G"]'
/*
Msg 547, Level 16, State 0, Line 108
The INSERT statement conflicted with the CHECK constraint "for_Valid_JSON_Postcodes_In_List".
The conflict occurred in database "business", table "dbo.deliveryRoutesJSON", column 'DriverRoute'.
The statement has been terminated.
*/

All we are doing is to iterate through the values checking each in turn

We’re still coping as we get to more complex examples than a simple list. However, the JSON could be anything. If it is representing tabular data, then it is easy.

Imagine we have to validate the JSON before we insert it. Imagine, too, that you have JSON consisting of a name, a birthdate, a guid and a modification-date,

[{"SalesPerson":"Suroor R Fatima","birthdate":"1978-02-25","Gender":"M","rowguid":"14010B0E-C101-4E41-B788-21923399E512","ModifiedDate":"2014-06-30T00:00:00"},{"SalesPerson":"John T Campbell","birthdate":"1956-08-07","Gender":"M","rowguid":"D4ED1F78-7C28-479B-BFEF-A73228BA2AAA","ModifiedDate":"2014-06-30T00:00:00"},{"SalesPerson":"Lori A Kane","birthdate":"1980-07-18","Gender":"F","rowguid":"23D436FC-08F7-4988-8B4D-490AA4E8B7E7","ModifiedDate":"2014-06-30T00:00:00"},{"SalesPerson":"Karen A Berg","birthdate":"1978-05-19","Gender":"F","rowguid":"45C3D0F5-3332-419D-AD40-A98996BB5531","ModifiedDate":"2014-06-30T00:00:00"},{"SalesPerson":"Alice O Ciccu","birthdate":"1978-01-26","Gender":"F","rowguid":"7E632B21-0D11-4BBA-8A68-8CAE14C20AE6","ModifiedDate":"2014-06-30T00:00:00"},{"SalesPerson":"Sandeep P Kaliyath","birthdate":"1970-12-03","Gender":"M","rowguid":"606C21E2-3EC0-48A6-A9FE-6BC8123AC786","ModifiedDate":"2014-06-30T00:00:00"},{"SalesPerson":"David M Bradley","birthdate":"1975-03-19","Gender":"M","rowguid":"E87029AA-2CBA-4C03-B948-D83AF0313E28","ModifiedDate":"2014-06-30T00:00:00"},{"SalesPerson":"Don L Hall","birthdate":"1971-06-13","Gender":"M","rowguid":"E720053D-922E-4C91-B81A-A1CA4EF8BB0E","ModifiedDate":"2014-06-30T00:00:00"},{"SalesPerson":"Tom M Vande Velde","birthdate":"1986-10-01","Gender":"M","rowguid":"B3BF7FC5-2014-48CE-B7BB-76124FA8446C","ModifiedDate":"2014-06-30T00:00:00"},{"SalesPerson":"Tengiz N Kharatishvili","birthdate":"1990-04-28","Gender":"M","rowguid":"C609B3B2-7969-410C-934C-62C34B63C4EE","ModifiedDate":"2014-06-30T00:00:00"},{"SalesPerson":"Mikael Q Sandberg","birthdate":"1984-08-17","Gender":"M","rowguid":"D0FD55FF-42FA-491E-8B3B-AB3316018909","ModifiedDate":"2014-06-30T00:00:00"},{"SalesPerson":"Ben T Miller","birthdate":"1973-06-03","Gender":"M","rowguid":"B9641CAE-765C-4662-B760-C167A1F2B8B5","ModifiedDate":"2014-06-30T00:00:00"}]

We can check this lot out easily enough, though the code might raise a few eyebrows. First we create a function. It starts by checking, just as the previous one did, for a valid JSON Document. Then it makes sure that there are the correct number of key/value pairs in each object of the array, to correspond with the number of columns in a relational table. It  then creates a table variable with all the SQL Server constraints, including NULL/NOT NULL in it that you would apply if the JSON table were a relational table, along with the correct SQL Server datatype. If one of these fire, then the error will appear at the point at which the JSON was inserted into the table. Also, if the JSON couldn’t be coerced into that datatype the function will once again ‘error-out’ and abort the data insertion. Finally, the table variable runs the constraints. That insertion will then ‘error out’ and cause the insertion into the table to fail.

In effect, you’ve done a dress rehearsal of inserting that data into a relational table, with all the constraints that are available to you.

CREATE OR ALTER FUNCTION [dbo].[IsSalesPeopleJSONValid]
/**
Summary: >
  this routine takes a JSON document of sales peole, and returns 0 if any 
  of them is invalid. It returns 1 if the entire document is valid
Author: phil factor
Date: 14/12/2018
Examples:
   - Select dbo.IsSalesPeopleJSONValid('
     [{"SalesPerson":"Suroor R Fatima","birthdate":"1978-02-25",
         "Gender":"M","rowguid":"14010B0E-C101-4E41-B788-21923399E512",
         "ModifiedDate":"2014-06-30T00:00:00"}]')
   - Select dbo.IsSalesPeopleJSONValid('
     [{"SalesPerson":"Suroor R Fatima","birthdate":"1878-02-25",
         "Gender":"M","rowguid":"14010B0E-C101-4E41-B788-21923399E512",
         "ModifiedDate":"2014-06-30T00:00:00"}]')
Returns: >
  an integer returns 1 if the json document of sales people  is valid
**/
(
    @SalesPeopleJSON NVARCHAR(MAX)
)
RETURNS INT
AS
BEGIN
IF IsJson(@SalesPeopleJSON)=0 return -1 --is this valid JSON?
--Does it have extra key value pairs squirelled away anywhere?
IF EXISTS (SELECT Count(*), doc.[key] FROM OpenJson(@SalesPeopleJSON) doc
CROSS APPLY OpenJson([value] ) row
GROUP BY doc.[key]
HAVING Count(*)<>5) RETURN 0 --go to be just one per column, fivwe in all

DECLARE @SalesPeople TABLE ( --create a table variable with check constraints
        SalesPerson NVARCHAR(40) NOT NULL,
        BirthDate DATE NOT NULL  CHECK (Year(BirthDate) BETWEEN 1900 AND YEAR(GETDATE())-16),
        Gender  CHAR(1) NOT null CHECK  ((Gender LIKE '[MFZmfz]')),
        RowGUID  UNIQUEIDENTIFIER NOT null,
        ModifiedDate DATETIME2 NOT null)
-- now insert a tabular version of the JSON document into the table variable
INSERT INTO @SalesPeople(SalesPerson,BirthDate, Gender, Rowguid, ModifiedDate)
SELECT SalesPerson,BirthDate, Gender,RowGUID,ModifiedDate from OPENJSON(@SalesPeopleJSON)
WITH( SalesPerson NVARCHAR(40),birthdate date, Gender CHAR(1),rowguid UNIQUEIDENTIFIER, ModifiedDate Datetime2)
--if you got here without an error you've succeeded
RETURN 1
END

With this function, we can now insert JSON data into the table without cringing, because it will all be checked. It is now up to us to get those constraint definitions right! First we create the test table. We’ll leave out everything except that which is relevant to the demonstration.

/* create a test table that uses this constraint */
IF Object_Id('Notes') IS NOT NULL
        DROP TABLE Notes
CREATE TABLE  Notes (
   Note_id INt IDENTITY,
  Customer INT NOT null,
  Note NVARCHAR(MAX),
  SalesPeopleInvolved NVARCHAR(MAX)
  CONSTRAINT for_Valid_SalesPeopleList CHECK (dbo.IsSalesPeopleJSONValid(SalesPeopleInvolved)=1),
  TheDate Date NULL DEFAULT getdate());

Now we insert a row

INSERT INTO Notes(customer,Note,SalesPeopleInvolved)
SELECT 354, 'Did not reply to our second payment request',
'[{"SalesPerson":"John T Campbell","birthdate":"1956-08-07",
"Gender":"M","rowguid":"D4ED1F78-7C28-479B-BFEF-A73228BA2AAA",
"ModifiedDate":"2014-06-30T00:00:00"}]'

OK. That went well

INSERT INTO Notes(customer,Note,SalesPeopleInvolved)
SELECT 3234, 'Now happy with our product',
'[{"SalesPerson":"Ben T Miller","birthdate":"1973-06-03",
"Gender":"B","rowguid":"B9641CAE-765C-4662-B760-C167A1F2B8B5",
"ModifiedDate":"2014-06-30T00:00:00"}]'

Oooh! An error! It has spotted that Ben Miller has a gender (__Gende__) of ‘B’

/*
Msg 547, Level 16, State 0, Line 229
The INSERT statement conflicted with the CHECK constraint "CK__#A23D3581__Gende__A4257DF3". 
The conflict occurred in database "tempdb", table "@SalesPeople".
*/

So lets try again, but getting the gender right but the GUID (Uniqueidentifier) wrong

INSERT INTO Notes(customer,Note,SalesPeopleInvolved)
SELECT 3234, 'Now happy with our product',
'[{"SalesPerson":"Ben T Miller","birthdate":"1973-06-03",
"Gender":"M","rowguid":"B964-765C-4662-B760-C167A1F2B8B5",
"ModifiedDate":"2014-06-30T00:00:00"}]'

/* Msg 8169, Level 16, State 2, Line 235
Conversion failed when converting from a character string to uniqueidentifier.*/

Right. Now we get a larger JSON document into the table

INSERT INTO Notes(customer,Note,SalesPeopleInvolved)
SELECT 334, 'This customer has phoned the entire office up and reduced them to tears',
'[{"SalesPerson":"Suroor R Fatima","birthdate":"1976-02-25","Gender":"Z","rowguid":"14010B0E-C101-4E41-B788-21923399E512","ModifiedDate":"2014-06-30T00:00:00"},{"SalesPerson":"John T Campbell","birthdate":"1956-08-07","Gender":"M","rowguid":"D4ED1F78-7C28-479B-BFEF-A73228BA2AAA","ModifiedDate":"2014-06-30T00:00:00"},{"SalesPerson":"Lori A Kane","birthdate":"1980-07-18","Gender":"F","rowguid":"23D436FC-08F7-4988-8B4D-490AA4E8B7E7","ModifiedDate":"2014-06-30T00:00:00"},{"SalesPerson":"Karen A Berg","birthdate":"1978-05-19","Gender":"F","rowguid":"45C3D0F5-3332-419D-AD40-A98996BB5531","ModifiedDate":"2014-06-30T00:00:00"},{"SalesPerson":"Alice O Ciccu","birthdate":"1978-01-26","Gender":"F","rowguid":"7E632B21-0D11-4BBA-8A68-8CAE14C20AE6","ModifiedDate":"2014-06-30T00:00:00"},{"SalesPerson":"Sandeep P Kaliyath","birthdate":"1970-12-03","Gender":"M","rowguid":"606C21E2-3EC0-48A6-A9FE-6BC8123AC786","ModifiedDate":"2014-06-30T00:00:00"},{"SalesPerson":"David M Bradley","birthdate":"1975-03-19","Gender":"M","rowguid":"E87029AA-2CBA-4C03-B948-D83AF0313E28","ModifiedDate":"2014-06-30T00:00:00"},{"SalesPerson":"Don L Hall","birthdate":"1971-06-13","Gender":"M","rowguid":"E720053D-922E-4C91-B81A-A1CA4EF8BB0E","ModifiedDate":"2014-06-30T00:00:00"},{"SalesPerson":"Tom M Vande Velde","birthdate":"1986-10-01","Gender":"M","rowguid":"B3BF7FC5-2014-48CE-B7BB-76124FA8446C","ModifiedDate":"2014-06-30T00:00:00"},{"SalesPerson":"Tengiz N Kharatishvili","birthdate":"1990-04-28","Gender":"M","rowguid":"C609B3B2-7969-410C-934C-62C34B63C4EE","ModifiedDate":"2014-06-30T00:00:00"},{"SalesPerson":"Mikael Q Sandberg","birthdate":"1984-08-17","Gender":"M","rowguid":"D0FD55FF-42FA-491E-8B3B-AB3316018909","ModifiedDate":"2014-06-30T00:00:00"},{"SalesPerson":"Ben T Miller","birthdate":"1973-06-03","Gender":"M","rowguid":"B9641CAE-765C-4662-B760-C167A1F2B8B5","ModifiedDate":"2014-06-30T00:00:00"}]'

Good. All went well.

Without that test in the function to make sure there were just five key/value pairs in each document, someone could slip extra data into the database. This could perhaps be data that should never go in there such as customer personal data. We have prevented this from happening

INSERT INTO Notes(customer,Note,SalesPeopleInvolved)
SELECT 3234, 'Now happy with our product',
'[{"SalesPerson":"Ben T Miller","birthdate":"1973-06-03",
"Gender":"M","rowguid":"B9641CAE-765C-4662-B760-C167A1F2B8B5",
"credit card":"4657-6758-4538-0987", "ModifiedDate":"2014-06-30T00:00:00"}]'

…which gives the result …

/*
Msg 547, Level 16, State 0, Line 298
The INSERT statement conflicted with the CHECK constraint "for_Valid_SalesPeopleList". The conflict occurred in database "business", table "dbo.Notes", column 'SalesPeopleInvolved'.
The statement has been terminated.
*/

Conclusion

As long as the JSON document being stored in a table is tabular in nature rather than hierarchical, we can deal with it in SQL Server. An array-in-array JSON document needs just light editing of the OPENJSON function call to access array elements rather than specifying the key. I’ve explained how to do this in another article.

There are a lot of advantages in using the weapons you know, and have to hand, when dealing with bad data. Constraints and coercion are good simple ways of ensuring that the data is correct.

If the JSON is hierarchical, then we are generally forced to deal with it by checking against the JSON Schema. I do this via PowerShell, so it can’t be done at the point of insertion. It also requires the developers to be organized enough to provide you an up-to-date JSON Schema. I’ve explained in another article how one can open up a hierarchical JSON document and investigate the values. This method can be used if you need to keep the checks ‘in-house’, but it is slow to debug and will need to be maintained if the JSON Schema changes as a development process.

The post Constraining and checking JSON Data in SQL Server Tables appeared first on Simple Talk.



from Simple Talk https://ift.tt/2Er0Txo
via

No comments:

Post a Comment