Tuesday, April 9, 2019

Using JSON for matrices in SQL Server.

From SQL Server 2017, it becomes more practical to use JSON arrays for representing and processing a matrix. SQL Server can read them, and update values in them but can’t create them. To do this, you have to create the JSON as a string. The great advantage of them is that you can pass them between procedures and functions as easily as any other string, and quickly turn them into tables.

For the SQL Server developer, matrices are probably most valuable for solving more complex string-searching problems, using Dynamic Programming. Once you get into the mindset of this sort of technique, a number of seemingly-intractable problems become easier.  Here are fifty common data structure problems that can be solved using Dynamic programming. Until SQL Server 2017, these were hard to do in SQL because of the lack of support for this style of programming.  Memoization, one of the principles behind the technique is easy to do in SQL but it is very tricky to convert existing procedural algorithms to use table variables. It is usually easier and quicker to use strings as pseudo-variables as I did  with Edit Distance and the Levenshtein algorithmthe longest common subsequence, and  the Longest Common Substring. The problem with doing this is that the code to fetch the array values can be very difficult to decypher or debug. JSON can do it very easily with path array references.

Would it be possible to use JSON arrays to solve one of these problems? If so, is it much slower? I thought it might be interesting to convert the Lowest Common Subsequence problem into a json-based form and run over a few tests back-to-back. The conclusion was, for those with the TLDR habit, was that it took twice to three times as long to run, but produced code that was easier to write, understand and debug.  I suspect there are ways and means to make it faster.

IF Object_Id(N'LCS') IS NOT NULL DROP FUNCTION LCS;
GO
CREATE FUNCTION LCS
  /**
summary:   >
 The longest common subsequence (LCS) problem is the problem of finding the
 longest subsequence common to all sequences in two sequences. It differs
 from problems of finding common substrings: unlike substrings, subsequences
 are not required to occupy consecutive positions within the original
 sequences. For example, the sequences "1234" and "1224533324" have an LCS
 of "1234":
Author: Phil Factor
Revision: 1.0
date: 05 April 2019
example:
 code: |
     Select dbo.lcs ('1234', '1224533324')
     Select dbo.lcs ('thisisatest', 'testing123testing')
     Select dbo.lcs ( 'XMJYAUZ', 'MZJAWXU') 
     Select dbo.lcs ( 'beginning-middle-ending',
       'beginning-diddle-dum-ending')
returns:   >
  the longest common subsequence as a string
**/
  (@xString VARCHAR(MAX), @yString VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
  BEGIN

    DECLARE @ii INT = 1; --inner index
    DECLARE @jj INT = 1; --next loop index
    DECLARE @West INT; --array reference number to left
    DECLARE @NorthWest INT; --array reference previous left
    DECLARE @North INT; --array reference previous
    DECLARE @Max INT; --holds the maximum of two values
    DECLARE @Current INT; --current number of matches
    DECLARE @Matrix NVARCHAR(MAX);
    DECLARE @PreviousRow NVARCHAR(2000); -- the previous matrix row
    DECLARE @JSON NVARCHAR(4000); --json work variable
    DECLARE @Numbers TABLE (jj INT);
-- SQL Prompt formatting off
INSERT INTO @numbers(jj) --this is designed for words of max 40 characters
VALUES(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),
      (16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),
          (29),(30),(31),(32),(33),(34),(35),(36),(37),(38),(39),(40)
-- SQL Prompt formatting on
--the to start with, the first row is all zeros.
    SELECT @PreviousRow =
      N'[' + Replicate('0,', Len(@xString) + 1) + N'"'
      + Substring(@yString, 1, 1) + N'"]';
    SELECT @Matrix = @PreviousRow;--add this to the matrix
        /* we now build the matrix in bottom up fashion.  */
    WHILE (@ii <= Len(@yString))
      BEGIN
        SELECT @West = 0, @JSON = NULL;
                --now create a row in just one query
        SELECT @NorthWest =
          Json_Value(@PreviousRow, '$[' + Cast(jj - 1 AS VARCHAR(5)) + ']'),
          @North =
            Json_Value(@PreviousRow, '$[' + Cast(jj AS VARCHAR(5)) + ']'),
          @Max = CASE WHEN @West > @North THEN @West ELSE @North END,
          @Current =
            CASE WHEN Substring(@xString, jj, 1) = Substring(@yString, @ii, 1) THEN
                   @NorthWest + 1 ELSE @Max END,
          @JSON =
            Coalesce(@JSON + ',', '[0,')
            + Coalesce(Cast(@Current AS VARCHAR(5)), 'null'), @West = @Current
          FROM @Numbers AS f
          WHERE f.jj <= Len(@xString);
                  --and store the result as the previous row
        SELECT @PreviousRow =
               @JSON + N',"' + Substring(@yString, @ii, 1) + N'"]';
          --and add the reow to the matrix
        SELECT @Matrix = Coalesce(@Matrix + ',
                       ', '') + @PreviousRow, @ii = @ii + 1;
      END;
    --we add the boundong brackets.
    SELECT @Matrix = N'[' + @Matrix + N']';
    SELECT @ii = Len(@yString), @jj = Len(@xString);
    DECLARE @previousColScore INT, @PreviousRowScore INT, @Ychar NCHAR;
    DECLARE @Subsequence NVARCHAR(4000) = '';
    WHILE (@Current > 0)
      BEGIN
        SELECT @Ychar = Substring(@yString, @ii, 1);
        IF (@Ychar = Substring(@xString, @jj, 1))
-- If current character in X[] and Y[] are same, then it is part of LCS
          SELECT @ii = @ii - 1, @jj = @jj - 1,
            @Subsequence = @Ychar + @Subsequence, @Current = @Current - 1;
        ELSE
--If not same, then find the larger of two and traverse in that direction 
          BEGIN
                    --find out the two scores, one to the north and one to the west
            SELECT @PreviousRowScore =
              Json_Value(
                          @Matrix,
                          'strict $[' + Convert(VARCHAR(5), @ii - 1) + ']['
                          + Convert(VARCHAR(5), @jj) + ']'
                        ),
              @previousColScore =
                Json_Value(
                            @Matrix,
                            'strict $[' + Convert(VARCHAR(5), @ii) + ']['
                            + Convert(VARCHAR(5), @jj - 1) + ']'
                          );
           --either go north or west
            IF @PreviousRowScore < @previousColScore SELECT @jj = @jj - 1;
            ELSE SELECT @ii = @ii - 1;
          END;
      END;
    RETURN @Subsequence;
  END;
GO
-- Now we do a quick test and timing with the old version
DECLARE @timing DATETIME;
SELECT @timing = GetDate();

IF dbo.LongestCommonSubsequence('1234', '1224533324') <> '1234'
  RAISERROR('test 1 failed', 16, 1);
IF dbo.LongestCommonSubsequence('thisisatest', 'testing123testing') <> 'tsitest'
  RAISERROR('test 2 failed', 16, 1);
IF dbo.LongestCommonSubsequence('Patient', 'Complaint') <> 'Paint'
  RAISERROR('test 3 failed', 16, 1);
IF dbo.LongestCommonSubsequence('XMJYAUZ', 'MZJAWXU') <> 'MJAU'
  RAISERROR('test 4 failed', 16, 1);
IF dbo.LongestCommonSubsequence('yab', 'xabyrbyab') <> 'yab' RAISERROR(
'test 5 failed', 16, 1
);
IF dbo.LongestCommonSubsequence(
'beginning-middle-ending', 'beginning-diddle-dum-ending'
) <> 'beginning-iddle-ending'
  RAISERROR('test 6 failed', 16, 1);

SELECT DateDiff(MILLISECOND, @timing, GetDate()) AS [ms FOR traditional way];
--now do the same test run with the current function
SELECT @timing = GetDate();

IF dbo.LCS('1234', '1224533324') <> '1234' RAISERROR('test 1 failed', 16, 1);
IF dbo.LCS('thisisatest', 'testing123testing') <> 'tsitest' RAISERROR(
'test 2 failed', 16, 1
);
IF dbo.LCS('Patient', 'Complaint') <> 'Paint'
  RAISERROR('test 3 failed', 16, 1);
IF dbo.LCS('XMJYAUZ', 'MZJAWXU') <> 'MJAU' RAISERROR('test 4 failed', 16, 1);
IF dbo.LCS('yab', 'xabyrbyab') <> 'yab' RAISERROR('test 5 failed', 16, 1);
IF dbo.LCS('beginning-middle-ending', 'beginning-diddle-dum-ending') <> 'beginning-iddle-ending'
  RAISERROR('test 6 failed', 16, 1);

SELECT DateDiff(MILLISECOND, @timing, GetDate()) AS [ms FOR JSON-based] ;

This returns …

ms FOR traditional way
----------------------
10

(1 row affected)

ms FOR JSON-based
-----------------
30

(1 row affected)

 

These tests form part of the build script for the procedure to try to make sure that as few as possible mistakes are left in! 

Obviously, I’d like a bit more speed in the JSON querying but it is acceptable unless one is doing a lot of this sort of querying. I’m happy to us JSON for doing this because it is quicker to get things up-and-running. In my article ‘Doing Fuzzy Searches in SQL Server‘, I show how it is possible to cut down on the amount of searches by filtering the likely candidates with conventional SQL Commands first.

 

 

The post Using JSON for matrices in SQL Server. appeared first on Simple Talk.



from Simple Talk http://bit.ly/2G1MIy8
via

No comments:

Post a Comment