Friday, June 21, 2019

A MongoDB to SQL Server Migration: Data Typing and Un-nesting

When recently assigned a project to migrate data from MongoDB to SQL Server, my initial concern was my lack of expertise with MongoDB itself. I can say now after completing the project, MongoDB was the absolute least of my worries and certainly not a challenge. In fact, the ultimate solution that was implemented bypassed the MongoDB service altogether and used a combination of a MongoDB dump file and the native command line tool bsondump to convert the dump files from BSON (Binary JSON) to JSON for direct load into SQL Server tables. I was fortunate to avoid the C# coding required for that initial load step to pull the data into staging tables whose columns were all NVACHAR(MAX). (I have a very smart colleague, Lou Talarico, who noodled through that chore, and I have some references at the end of this article for further information). My tasks were simple comparatively: all I had to do was generate CREATE TABLE scripts with correct data types from the data in the staging tables and transpose the delimited array data stored in many of the columns. In this article, I will provide the sample data and code that will accomplish these two tasks.

The MongoDB Data

Most of us have had it drilled into our SQL brains for good reason that while NVARCHAR(MAX) has its place, it is not ideal for every column in every table. There is plenty of supporting evidence the you can find with a quick search that shows that the query optimizer does not always play nice when generating an optimal execution plan for these large data types. While it was the path of least resistance to use this data type to populate the SQL Server staging tables from the Extended JSON files MondoDB produced, it would not have been ideal for the final data types that the analysts would be writing queries against.

I decided to try and generate the code to create the final tables using a combination of metadata and a sampling of the actual data in staging tables. I would need to do this for each column and pull in any non-null values to determine what data type it should be including its max length. Before reviewing the code that will generate the final CREATE TABLE statements, take a look at some sample data so that you may understand the challenge.

Listing 1: Create and Populate Sample Table, doc_staging

CREATE TABLE [dbo].[doc_staging](
        [ID] [nvarchar](max) NULL,
        [CreateDate] [nvarchar](max) NULL,
        [mversion2] [nvarchar](max) NULL,
        [Version] [nvarchar](max) NULL,
        [doc_id] [nvarchar](max) NULL,
        [doc_type] [nvarchar](max) NULL,
        [doc_requirement] [nvarchar](max) NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2], 
    [Version], [doc_id], [doc_type], [doc_requirement]) 
VALUES (N'7E3BB454-80B1-406D-9D8B-7223A412CEF8', N'4/10/2017 8:47:17 PM'
    , N'11.72', N'17', N'63||93||28||50||38', N'PDF||PDF||PDF||PDF||PDF'
    , N'True||True||True||True||True')
INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2],
    [Version], [doc_id], [doc_type], [doc_requirement]) 
VALUES (N'6BA95F5A-9A92-41AF-AD1E-588AEA39600C', N'3/11/2017 9:54:00 PM'
    , N'7.59', N'11', N'16||33||79||85||11', N'PDF||PDF||PDF||PDF||PDF' 
    , N'True||True||True||True||True')
INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2], 
    [Version], [doc_id], [doc_type], [doc_requirement]) 
VALUES (N'567FCEA7-F0E0-473B-94EA-127923FA0D36', N'3/14/2017 1:56:22 PM'
    , N'10.34', N'15', N'72||30||9||82||79||81||37'
    , N'PDF||PDF||PDF||PDF||PDF||PDF||PDF'
    , N'True||True||True||True||True||True||True')
INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2], 
    [Version], [doc_id], [doc_type], [doc_requirement]) 
VALUES (N'236FBDFD-857D-46E1-89EC-8A6911736145', N'3/4/2016 4:15:04 AM'
    , N'19.31', N'28', N'35||42||16||45||13||91||13'
    , N'PDF||PDF||PDF||PDF||PDF||PDF||MS Word'
    , N'True||True||True||True||True||True||True')
INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2], 
    [Version], [doc_id], [doc_type], [doc_requirement]) 
VALUES (N'EEBB9163-B88F-44F1-9F9F-ABEF0418EF78', N'3/31/2017 3:45:31 AM'
    , N'4.83', N'7', NULL, NULL, NULL)
INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2], 
    [Version], [doc_id], [doc_type], [doc_requirement]) 
VALUES (N'630644C3-6001-4B32-AF45-3C7EA6F0CA70', N'6/6/2016 1:30:11 AM'
    , N'5.52', N'8', N'63||18||92||54', N'PDF||PDF||PDF||PDF'
    , N'True||True||True||True')
INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2], 
    [Version], [doc_id], [doc_type], [doc_requirement]) 
VALUES (N'46998AA0-CCEA-41C5-9482-E94384FB11B0', N'8/11/2016 3:58:40 AM'
    , N'13.79', N'20', N'57', N'MS Word', N'True')
INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2], 
    [Version], [doc_id], [doc_type], [doc_requirement]) 
VALUES (N'695523CE-6DA4-4228-848C-3D92096AE18B', N'3/16/2017 3:27:06 PM'
    , N'13.10', N'19', N'95||22||12||81||86||94||67'
    , N'PDF||PDF||PDF||PDF||PDF||PDF||PDF'
    , N'True||True||True||True||True||True||True')
INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2], 
    [Version], [doc_id], [doc_type], [doc_requirement]) 
VALUES (N'C7802FCC-95D0-48D6-BC02-DE0704D64F45', N'5/12/2016 2:32:26 AM'
    , N'2.76', N'4', NULL, NULL, NULL)
INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2], 
    [Version], [doc_id], [doc_type], [doc_requirement]) 
VALUES (N'80D248F3-A6DC-440E-8845-3413E4ADDB88', N'5/24/2016 2:33:25 PM'
    , N'2.07', N'3', NULL, NULL, NULL)
INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2], 
    [Version], [doc_id], [doc_type], [doc_requirement]) 
VALUES (N'8FCE0A51-770B-43AF-997E-7B0D28236372', N'3/9/2017 9:57:52 PM'
    , N'11.72', N'17', N'98||90||89||2||97||50'
    , N'PDF||PDF||PDF||MS Word||MS Word||PDF'
    , N'True||True||True||False||True||True')
INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2], 
    [Version], [doc_id], [doc_type], [doc_requirement]) 
VALUES (N'29318D29-1FB1-4C91-B7A3-76A6A8185497', N'5/23/2016 8:46:08 PM'
    , N'8.97', N'13', N'43||79||51||36', N'PDF||PDF||PDF||MS Word'
    , N'True||True||True||True')

The above query creates the table and inserts 12 sample records. You can see that all of the columns use nvarchar(max) as the data type. The data itself contains common data types that will need to be converted, such as int, datetime and decimal.

Several fields are double pipe “||” delimited strings, each containing concatenated values. For example, doc_id in the first row has the value N’63||93||28||50||38′. In MongoDB, these were nested records in the collection, and this is how the conversion brought these over to the staging table, as one flat record. Further, the doc_id field is ordinally related to the doc_type and doc_requirements fields. The requirement is to transpose these values to individual rows for the final conversion. But first you have to build the ultimate, properly data typed “production” table.

The Solution

The following code builds the create table script and uses a simple iterative technique (thankfully not cursors) that reads each non-null value from each column, gleaned from the metadata in INFORMATION_SCHEMA.COLUMNS, and tries to ascertain the data type of the values. It is limited to a small set of data types like datetime, int, decimal and character-based values, but I have found it to be reliable and fast. Listing 2 shows the code in its entirety, commented to explain the process flow.

Listing 2: The code to create a CREATE TABLE script

SET nocount ON; 
--DECLARE variables and tables
DECLARE @Src NVARCHAR(max) 
DECLARE @Col NVARCHAR(max); 
DECLARE @SQL NVARCHAR(max); 
DECLARE @SQLCUR NVARCHAR(max); 
DECLARE @SQL_CASE NVARCHAR(max); 
DECLARE @maxlen INT; 
DECLARE @DB_Col TABLE 
  ( 
     rownum      INT, 
     column_name NVARCHAR(max) 
  ); 
DECLARE @Tablesrc NVARCHAR(max) 
DECLARE @MAX_Rec       INT, 
        @CURRENTRECORD INT; 
DECLARE @MAX_RecTAB       INT, 
        @CURRENTRECORDTAB INT; 
DECLARE @DT NVARCHAR(max) 
CREATE TABLE #temp 
  ( 
     #temp NVARCHAR(max) 
  ) 
CREATE TABLE #temp2 
  ( 
     #temp NVARCHAR(max) 
  ) 
--Create temp table #db_tab to hold table metadata 
--from INFORMATION_SCHEMA.TABLES
SELECT ROWNUM =Row_number() 
                 OVER ( 
                   ORDER BY table_name), 
       table_schema, 
       table_name 
INTO   #db_tab 
FROM   INFORMATION_SCHEMA.TABLES
--optional WHERE clause. If commented out, each user table in the 
--database will be returned
WHERE  table_name = 'doc_STAGING' 
--Set max record value for iteration through tables 
--(sample include only 1 table)
SET @CURRENTRECORDTAB = 1 
SET @MAX_RecTAB = (SELECT Max(rownum) 
                   FROM   #db_tab) 
--Iterate through each table                               
WHILE @CURRENTRECORDTAB <= @MAX_RecTAB 
  BEGIN 
      SELECT @Src = table_schema + '.' + table_name 
      FROM   #db_tab 
      WHERE  rownum = @CURRENTRECORDTAB 
--Remove "_staging" from the new table name
      SELECT @Tablesrc = Replace(table_name, '_STAGING', '') 
      FROM   #db_tab 
      WHERE  rownum = @CURRENTRECORDTAB 
--Populate next temp table variable "@DB_Col" with column metadata 
--from INFORMATION_SCHEMA.COLUMNS
      INSERT INTO @DB_Col 
      SELECT ROWNUM = Row_number() 
                        OVER ( 
                          ORDER BY ordinal_position), 
             column_name 
      FROM   INFORMATION_SCHEMA.COLUMNS 
      WHERE  table_schema + '.' + table_name = @Src
          
--Nested iteration for each column in the staging table
      SET @CURRENTRECORD = 1; 
--Begin building dynamic CREATE TABLE statement. DROP TABLE 
--can be optional
      --SET @SQL = N'DROP TABLE IF EXISTS ' + @Tablesrc + ';' 
      --           + Char(13) + Char(13) + '  CREATE TABLE dbo.' 
      --           + @Tablesrc + N' (' + Char(13); 
      SET @SQL =  '  CREATE TABLE dbo.' 
                 + @Tablesrc + N' (' + Char(13); 
--Get the max number of columns (ordinal_position field in 
--INFORMATION_SCHEMA.COLUMNS)
      SET @MAX_Rec = (SELECT Max(ordinal_position) 
                      FROM   INFORMATION_SCHEMA.COLUMNS 
                      WHERE  table_schema + '.' + table_name = @Src); 
--Start column iteration
      WHILE @CURRENTRECORD <= @MAX_Rec 
 --Set current column name using rownum
 BEGIN 
            SELECT @Col = N'[' + column_name + N']' 
            FROM   @DB_Col 
            WHERE  rownum = @CURRENTRECORD; 
--Get the max length of all non null values for current column
            SET @SQLCUR = N'Select   max(len(' + @Col + ')) from ' 
                + @Src + ' where ' + @Col + ' is not null'; 
            INSERT #temp 
            EXEC (@SQLCUR);
            SET @maxlen = (SELECT TOP 1 * 
                           FROM   #temp); 
--Round max length to nearest 10
            SET @maxlen = Round(@maxlen, -1); 
--This does all of the data type guessing work and writing the
-- output to a second temp table #temp2
           SET @SQL_CASE = N'Select TOP 1 CASE  WHEN Isdate(' + @Col 
                + N') = 1  AND ISNUMERIC(' + @Col 
                + N') <> 1 THEN ''DATETIME2'' WHEN ISNUMERIC(' + @Col 
                + N') = 1 and ' + @Col + N' 
                NOT LIKE ''%.%''THEN ''INT'' WHEN ISNUMERIC(' + @Col 
                + N') = 1 and ' + @Col 
                + N' LIKE ''%.%''THEN ''decimal(18,2)'' WHEN (isdate(' 
                + @Col + N') <> 1 and ISNUMERIC(' + @Col 
                + N') <> 1)  AND len(' + @Col + N') > 0 AND ' 
                + Cast(@maxlen AS VARCHAR(5)) 
                + ' < 4000 THEN ''NVARCHAR(' 
                + Cast(@maxlen + 20 AS VARCHAR(5)) 
                + N')'' ELSE ''NVARCHAR(MAX)'' END From ' 
                + @Src + ' where ' + @Col + N' is not null'; 
            INSERT #temp2 
            EXEC (@SQL_CASE); 
--Get the data type (@DT) from #temp2
            SET @DT = (SELECT TOP 1 * 
                       FROM   #temp2); 
            SET @SQL = @SQL + N'      ' + @Col + N' ' 
                       + Isnull(@DT, 'NVARCHAR(255)') + N',' + Char(13); 
  --We are finished with current column, let's delete these 
  --and get new ones
  DELETE FROM #temp;
  DELETE FROM #temp2; 
            SET @CURRENTRECORD = @CURRENTRECORD + 1; 
        END; 
--Build final SQL statement used to print CREATE TABLE 
--statement (optional EXEC)
      SET @SQL = LEFT(@SQL, ( Len(@SQL) - 2 )) 
      SET @SQl = @SQL + Char(13) + ')' 
 --Print the final CREATE TABLE SCRIPT
          PRINT @SQL; 
      SET @CURRENTRECORDTAB = @CURRENTRECORDTAB + 1; 
  END 
DROP TABLE [#db_tab];
DROP TABLE [#temp]; 
DROP TABLE [#temp2];

The code itself simply iterates through one or more tables, gathering metadata for column names and ordinal positions, interrogates each column to determine its max length and assumed data type and then generates a CREATE TABLE script, adding 20 to the rounded character-based fields. This version of the code uses Unicode data types. For the small sample table, it immediately returns the following:

CREATE TABLE dbo.doc (
      [ID] NVARCHAR(60),
      [CreateDate] DATETIME2,
      [mversion2] decimal(18,2),
      [Version] INT,
      [doc_id] NVARCHAR(80),
      [doc_type] NVARCHAR(140),
      [doc_requirement] NVARCHAR(110)
)

 

Notice that the _staging suffix has been removed and the final table will be dbo.doc. This, of course, is easily modifiable.

In the real project, there were tens of thousands of rows and the code still performed surprisingly well.

After running the CREATE TABLE statement to create the empty base table, appropriately data typed, it is time to populate it from the same staging data. Recall the requirement to transpose the delimited values from the doc_id, doc_type and doc_requirments columns so that these become a separate row.

You will need to transpose this:

43||79||51||36

Into this:

43

79

51

36

And further, you will need to join these doc_ids to their precise doc_type and doc_requirement values.

The figure below shows the first row, using the sample data.

The next figure shows what the final result should be. Notice that the doc_ids have maintained their ordinal positions despite their numerical order, which is crucial because the values in both doc_type and doc_requirement share the same positional location as the corresponding doc_id that they are related to.

Fortunately, there is a new function in SQL Server 2016 and higher called STRING_SPLIT() that handles this type of pivoting of delimited values with ease and does so very efficiently. The trick to getting these values to line up uses a combination of cross-applying the results of the STRING_SPLIT function with the base table and using the window function ROW_NUMBER to maintain the position.

The following listing shows the simple query to perform the work.

Listing 3: Splitting the field

SELECT *, arv.value AS doc_id_split, 
    Row_number() OVER (PARTITION BY id ORDER BY id ) AS RN 
FROM  [dbo].[doc_staging] 
CROSS apply String_split(Replace([doc_id], '||', '|'), N'|') arv

Notice that ROW_NUMBER is partitioned and ordered on the ID value of each row, which will return a running tally based on the number of delimited values that are cross applied.

You can then use the row number to join to the same cross applied row numbers for doc_type and doc_requirement. Remember each field to transpose has a variable number of actual delimited values; some rows may have five or more values, and others may only have one, but each row will contain the same number of doc_ids as there are doc_type and doc_requirement values.

It is also worth pointing out that the REPLACE function changes the double pipe to a single pipe character. Using the double pipe was a choice we made to have more assurance that there would not be actual data values that contain this combination of characters, but the downside is that the STRING_SPLIT function will only take a single byte value for the delimiter. Hence, the double pipe must be replaced with a single character delimiter. The final code to return exactly what is needed for the sample data, all 46 new rows, is in the following listing, which includes a CTE-based solution.

Listing 4: CTE to transpose the data values from the delimited values

WITH split_cte
AS (SELECT *,
           arv.value AS doc_id_split,
           ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS RN
    FROM [dbo].[doc_staging]
        CROSS APPLY STRING_SPLIT(REPLACE([doc_id], '||', '|'), N'|') 
        arv),
     split_cte2
AS (SELECT id,
           arv2.value AS doc_type,
           ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS RN
    FROM [dbo].[doc_staging]
        CROSS APPLY STRING_SPLIT(REPLACE([doc_type], '||', '|'), N'|') 
          arv2),
     split_cte3
AS (SELECT id,
           arv3.value AS doc_requirement,
           ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS RN
    FROM [dbo].[doc_staging]
        CROSS APPLY STRING_SPLIT(REPLACE([doc_requirement], '||', '|')
           , N'|') arv3)
--INSERT INTO dbo.doc
SELECT sc1.ID,
       sc1.CreateDate,
       sc1.mversion2,
       sc1.Version,
       --,sc1.doc_id
       sc1.doc_id_split AS doc_id,
       sc2.doc_type,
       sc3.doc_requirement
FROM split_cte sc1
    INNER JOIN
    (SELECT id, split_cte2.doc_type, split_cte2.RN FROM split_cte2) sc2
        ON sc1.id = sc2.id
           AND sc1.RN = sc2.RN
    INNER JOIN
    (
        SELECT id,
               split_cte3.doc_requirement,
               split_cte3.RN
        FROM split_cte3
    ) sc3
        ON sc1.id = sc3.id
           AND sc1.RN = sc3.RN;

Each CTE splits one of the delimited columns, and the CTEs are joined together in the outer query. Since the OVER clause for ROW_NUMBER is identical in each case, they join together perfectly on the ID and row number.

By using this code, you can easily insert the results into the newly created dbo.doc table. I have commented out that insert statement so that you can see the results first.

Summary

I like the idea of having reusable code, even if it is to pick out certain pieces from a larger scope. I believe the code I am sharing here, which certainly has much room for improvement and extension, can be readily used for a variety of other purposes with some slight modification. One such goal would be to generate create table scripts for each table in an existing database. Instead of deriving the data type, it is possible just to use an existing data type. I will most likely go back to it many times for future projects that may involve a REST API, for example, rather than a MongoDB conversion. I know, too, that I will find a need to transpose delimited data somewhere down the road. Even if you do not have a specific need for a MongoDB conversion, I hope that the techniques shown here will help you in your projects.

References

For further information on the JSON framework used for the C# code as well as the bsondump utility, please see the following links.

https://www.newtonsoft.com/json

https://docs.mongodb.com/manual/reference/program/bsondump/

It is worth pointing out that the documentation for BSDUMP says that it is a diagnostic tool and not a tool for data ingestion or other application use.

 

The post A MongoDB to SQL Server Migration: Data Typing and Un-nesting appeared first on Simple Talk.



from Simple Talk http://bit.ly/2KueoR0
via

No comments:

Post a Comment