From SQL Server 2017, it becomes more practical to use JSON arrays for representing and processing a matrix. SQL Server can read them, and update values in them but can’t create them. To do this, you have to create the JSON as a string. The great advantage of them is that you can pass them between procedures and functions as easily as any other string, and quickly turn them into tables.
For the SQL Server developer, matrices are probably most valuable for solving more complex string-searching problems, using Dynamic Programming. Once you get into the mindset of this sort of technique, a number of seemingly-intractable problems become easier. Here are fifty common data structure problems that can be solved using Dynamic programming. Until SQL Server 2017, these were hard to do in SQL because of the lack of support for this style of programming. Memoization, one of the principles behind the technique is easy to do in SQL but it is very tricky to convert existing procedural algorithms to use table variables. It is usually easier and quicker to use strings as pseudo-variables as I did with Edit Distance and the Levenshtein algorithm, the longest common subsequence, and the Longest Common Substring. The problem with doing this is that the code to fetch the array values can be very difficult to decypher or debug. JSON can do it very easily with path array references.
Would it be possible to use JSON arrays to solve one of these problems? If so, is it much slower? I thought it might be interesting to convert the Lowest Common Subsequence problem into a json-based form and run over a few tests back-to-back. The conclusion was, for those with the TLDR habit, was that it took twice to three times as long to run, but produced code that was easier to write, understand and debug. I suspect there are ways and means to make it faster.
IF Object_Id(N'LCS') IS NOT NULL DROP FUNCTION LCS; GO CREATE FUNCTION LCS /** summary: > The longest common subsequence (LCS) problem is the problem of finding the longest subsequence common to all sequences in two sequences. It differs from problems of finding common substrings: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences. For example, the sequences "1234" and "1224533324" have an LCS of "1234": Author: Phil Factor Revision: 1.0 date: 05 April 2019 example: code: | Select dbo.lcs ('1234', '1224533324') Select dbo.lcs ('thisisatest', 'testing123testing') Select dbo.lcs ( 'XMJYAUZ', 'MZJAWXU') Select dbo.lcs ( 'beginning-middle-ending', 'beginning-diddle-dum-ending') returns: > the longest common subsequence as a string **/ (@xString VARCHAR(MAX), @yString VARCHAR(MAX)) RETURNS VARCHAR(MAX) AS BEGIN DECLARE @ii INT = 1; --inner index DECLARE @jj INT = 1; --next loop index DECLARE @West INT; --array reference number to left DECLARE @NorthWest INT; --array reference previous left DECLARE @North INT; --array reference previous DECLARE @Max INT; --holds the maximum of two values DECLARE @Current INT; --current number of matches DECLARE @Matrix NVARCHAR(MAX); DECLARE @PreviousRow NVARCHAR(2000); -- the previous matrix row DECLARE @JSON NVARCHAR(4000); --json work variable DECLARE @Numbers TABLE (jj INT); -- SQL Prompt formatting off INSERT INTO @numbers(jj) --this is designed for words of max 40 characters VALUES(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15), (16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28), (29),(30),(31),(32),(33),(34),(35),(36),(37),(38),(39),(40) -- SQL Prompt formatting on --the to start with, the first row is all zeros. SELECT @PreviousRow = N'[' + Replicate('0,', Len(@xString) + 1) + N'"' + Substring(@yString, 1, 1) + N'"]'; SELECT @Matrix = @PreviousRow;--add this to the matrix /* we now build the matrix in bottom up fashion. */ WHILE (@ii <= Len(@yString)) BEGIN SELECT @West = 0, @JSON = NULL; --now create a row in just one query SELECT @NorthWest = Json_Value(@PreviousRow, '$[' + Cast(jj - 1 AS VARCHAR(5)) + ']'), @North = Json_Value(@PreviousRow, '$[' + Cast(jj AS VARCHAR(5)) + ']'), @Max = CASE WHEN @West > @North THEN @West ELSE @North END, @Current = CASE WHEN Substring(@xString, jj, 1) = Substring(@yString, @ii, 1) THEN @NorthWest + 1 ELSE @Max END, @JSON = Coalesce(@JSON + ',', '[0,') + Coalesce(Cast(@Current AS VARCHAR(5)), 'null'), @West = @Current FROM @Numbers AS f WHERE f.jj <= Len(@xString); --and store the result as the previous row SELECT @PreviousRow = @JSON + N',"' + Substring(@yString, @ii, 1) + N'"]'; --and add the reow to the matrix SELECT @Matrix = Coalesce(@Matrix + ', ', '') + @PreviousRow, @ii = @ii + 1; END; --we add the boundong brackets. SELECT @Matrix = N'[' + @Matrix + N']'; SELECT @ii = Len(@yString), @jj = Len(@xString); DECLARE @previousColScore INT, @PreviousRowScore INT, @Ychar NCHAR; DECLARE @Subsequence NVARCHAR(4000) = ''; WHILE (@Current > 0) BEGIN SELECT @Ychar = Substring(@yString, @ii, 1); IF (@Ychar = Substring(@xString, @jj, 1)) -- If current character in X[] and Y[] are same, then it is part of LCS SELECT @ii = @ii - 1, @jj = @jj - 1, @Subsequence = @Ychar + @Subsequence, @Current = @Current - 1; ELSE --If not same, then find the larger of two and traverse in that direction BEGIN --find out the two scores, one to the north and one to the west SELECT @PreviousRowScore = Json_Value( @Matrix, 'strict $[' + Convert(VARCHAR(5), @ii - 1) + '][' + Convert(VARCHAR(5), @jj) + ']' ), @previousColScore = Json_Value( @Matrix, 'strict $[' + Convert(VARCHAR(5), @ii) + '][' + Convert(VARCHAR(5), @jj - 1) + ']' ); --either go north or west IF @PreviousRowScore < @previousColScore SELECT @jj = @jj - 1; ELSE SELECT @ii = @ii - 1; END; END; RETURN @Subsequence; END; GO -- Now we do a quick test and timing with the old version DECLARE @timing DATETIME; SELECT @timing = GetDate(); IF dbo.LongestCommonSubsequence('1234', '1224533324') <> '1234' RAISERROR('test 1 failed', 16, 1); IF dbo.LongestCommonSubsequence('thisisatest', 'testing123testing') <> 'tsitest' RAISERROR('test 2 failed', 16, 1); IF dbo.LongestCommonSubsequence('Patient', 'Complaint') <> 'Paint' RAISERROR('test 3 failed', 16, 1); IF dbo.LongestCommonSubsequence('XMJYAUZ', 'MZJAWXU') <> 'MJAU' RAISERROR('test 4 failed', 16, 1); IF dbo.LongestCommonSubsequence('yab', 'xabyrbyab') <> 'yab' RAISERROR( 'test 5 failed', 16, 1 ); IF dbo.LongestCommonSubsequence( 'beginning-middle-ending', 'beginning-diddle-dum-ending' ) <> 'beginning-iddle-ending' RAISERROR('test 6 failed', 16, 1); SELECT DateDiff(MILLISECOND, @timing, GetDate()) AS [ms FOR traditional way]; --now do the same test run with the current function SELECT @timing = GetDate(); IF dbo.LCS('1234', '1224533324') <> '1234' RAISERROR('test 1 failed', 16, 1); IF dbo.LCS('thisisatest', 'testing123testing') <> 'tsitest' RAISERROR( 'test 2 failed', 16, 1 ); IF dbo.LCS('Patient', 'Complaint') <> 'Paint' RAISERROR('test 3 failed', 16, 1); IF dbo.LCS('XMJYAUZ', 'MZJAWXU') <> 'MJAU' RAISERROR('test 4 failed', 16, 1); IF dbo.LCS('yab', 'xabyrbyab') <> 'yab' RAISERROR('test 5 failed', 16, 1); IF dbo.LCS('beginning-middle-ending', 'beginning-diddle-dum-ending') <> 'beginning-iddle-ending' RAISERROR('test 6 failed', 16, 1); SELECT DateDiff(MILLISECOND, @timing, GetDate()) AS [ms FOR JSON-based] ;
This returns …
ms FOR traditional way ---------------------- 10 (1 row affected) ms FOR JSON-based ----------------- 30 (1 row affected)
These tests form part of the build script for the procedure to try to make sure that as few as possible mistakes are left in!
Obviously, I’d like a bit more speed in the JSON querying but it is acceptable unless one is doing a lot of this sort of querying. I’m happy to us JSON for doing this because it is quicker to get things up-and-running. In my article ‘Doing Fuzzy Searches in SQL Server‘, I show how it is possible to cut down on the amount of searches by filtering the likely candidates with conventional SQL Commands first.
The post Using JSON for matrices in SQL Server. appeared first on Simple Talk.
from Simple Talk http://bit.ly/2G1MIy8
via
No comments:
Post a Comment