Friday, November 16, 2018

How to validate JSON Data before you import it into a database.

If you are, as you should be, checking JSON data in a whole lot of files before you import them into your database, you would do well to use JSON Schema, because you can run a number of checks such as regex checks that can’t be done any other way, and it is usually possible to detect bad data

Obviously, the most immediate value that the relational database person gets from doing this chore is to check that all the required (not null) columns are there for every row, and that they have the right sort of data. This can be done in very general terms if you use the ‘type’ field, but far more precisely if you are handy with PCRE regex.

In a nutshell, JSON Schema defines the way that you’ve structured your JSON document, the data types and constraints upon that data. It is extensible in that you can add your own special-purpose fields. It is ideal for tabular data because it can define how you represent columns, and what their names are. It can also enforce the structure so that tables and grids are safe within a JSON document.

If you have a very small amount of data, you can do the validation in an online JSON Validator such as the NewtonSoft JSON Schema Validator at https://www.jsonschemavalidator.net/.

This is currently the best way of building up a JSON Schema, and trying out the features of the validator, because you can build and test the rules you create against a subset of the table data.

Validation runs are a different problem Obviously, doing this in an online app will soon become tiresome, even with the smallest database. Imagine having to do this with a regular data feed or REST webservice! It is, however, possible to automate this.

There are many different ways of automating the validation process, using a range of platforms and frameworks. As I’m mostly using PowerShell for database scripting, I do it that way. I use NewtonSoft’s JSON.NET validator.

Imagine that you need to validate the aging classic database ‘Pubs’.

You have a data directory

You have a directory with the JSON data, one file per table, named after the table.

You also have a sibling directory with the schemas. (I’ve shown elsewhere how to generate the basic schemas from a SQL Server database, ready for your refinements.

I’m not suggesting that this is ideal, merely the way that I’ve set up the demo.

Now we can validate the two.

This is done here as a separate script, but it is likely to be part of an ingestion process. Your schemas will probably stay there, saved in source control for each data feed.

$ErrorActionPreference = "Stop"
# enter the base directory 
$Path = 'S:\work\programs\SQL\ScriptsDirectory\PentlowMillServ\pubs\Data'
# ...and the names of the subdirectories
$SchemaDirectory = 'JSONSchema\'
$DataDirectory = 'JSONData\'
# all this following section thanks to James Newton-King
$NewtonsoftJsonPath = Resolve-Path -Path "lib\Newtonsoft.Json.dll"
$NewtonsoftJsonSchemaPath = Resolve-Path -Path "lib\Newtonsoft.Json.Schema.dll"

Add-Type -Path $NewtonsoftJsonPath
Add-Type -Path $NewtonsoftJsonSchemaPath


# define the validator type
$source = @'
    public class Validator
    {
        public static System.Collections.Generic.IList<string> Validate(Newtonsoft.Json.Linq.JToken token, Newtonsoft.Json.Schema.JSchema schema)
        {
            System.Collections.Generic.IList<string> messages;
            Newtonsoft.Json.Schema.SchemaExtensions.IsValid(token, schema, out messages);
            return messages;
        }
    }
'@
Add-Type -TypeDefinition $source -ReferencedAssemblies $NewtonsoftJsonPath, $NewtonsoftJsonSchemaPath
# end of James Newton-King's code. Thanks, James.

Get-ChildItem "$($Path)\$($DataDirectory)" -Filter *.json | select Name | Foreach{
# do every file in the directory        
        $JSON = [IO.File]::ReadAllText("$($Path)\$($DataDirectory)$($_.Name)")
        $Schema = [IO.File]::ReadAllText("$($Path)\$($SchemaDirectory)$($_.Name)")
# parse the JSON files documents into a tokenised form
        $Token = [Newtonsoft.Json.Linq.JToken]::Parse($JSON)
        $Schema = [Newtonsoft.Json.Schema.JSchema]::Parse($Schema)
# do the validation, using the parsed documents 
        $ErrorMessages = [Validator]::Validate($Token, $Schema)
        if ($ErrorMessages.Count -eq 0)
        { write-host "Schema is valid" } #just for the test. I don't approve of write-host!
        else #I've selected just the first five because usually a whole column is wrong!
        { $ErrorMessages | Select-Object -First 5 | foreach{ write-warning $_ } }

This isn’t the only way of doing this. There are plenty of validators, but the JSON.NET validator is a good place to start if you are already heavily into PowerShell. For more details of JSON Schema, see http://json-schema.org/

The post How to validate JSON Data before you import it into a database. appeared first on Simple Talk.



from Simple Talk https://ift.tt/2QI78QU
via

No comments:

Post a Comment