When recently assigned a project to migrate data from MongoDB to SQL Server, my initial concern was my lack of expertise with MongoDB itself. I can say now after completing the project, MongoDB was the absolute least of my worries and certainly not a challenge. In fact, the ultimate solution that was implemented bypassed the MongoDB service altogether and used a combination of a MongoDB dump file and the native command line tool bsondump
to convert the dump files from BSON (Binary JSON) to JSON for direct load into SQL Server tables. I was fortunate to avoid the C# coding required for that initial load step to pull the data into staging tables whose columns were all NVACHAR(MAX).
(I have a very smart colleague, Lou Talarico, who noodled through that chore, and I have some references at the end of this article for further information). My tasks were simple comparatively: all I had to do was generate CREATE TABLE
scripts with correct data types from the data in the staging tables and transpose the delimited array data stored in many of the columns. In this article, I will provide the sample data and code that will accomplish these two tasks.
The MongoDB Data
Most of us have had it drilled into our SQL brains for good reason that while NVARCHAR(MAX)
has its place, it is not ideal for every column in every table. There is plenty of supporting evidence the you can find with a quick search that shows that the query optimizer does not always play nice when generating an optimal execution plan for these large data types. While it was the path of least resistance to use this data type to populate the SQL Server staging tables from the Extended JSON files MondoDB produced, it would not have been ideal for the final data types that the analysts would be writing queries against.
I decided to try and generate the code to create the final tables using a combination of metadata and a sampling of the actual data in staging tables. I would need to do this for each column and pull in any non-null values to determine what data type it should be including its max length. Before reviewing the code that will generate the final CREATE TABLE
statements, take a look at some sample data so that you may understand the challenge.
CREATE TABLE [dbo].[doc_staging]( [ID] [nvarchar](max) NULL, [CreateDate] [nvarchar](max) NULL, [mversion2] [nvarchar](max) NULL, [Version] [nvarchar](max) NULL, [doc_id] [nvarchar](max) NULL, [doc_type] [nvarchar](max) NULL, [doc_requirement] [nvarchar](max) NULL ) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY] GO INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2], [Version], [doc_id], [doc_type], [doc_requirement]) VALUES (N'7E3BB454-80B1-406D-9D8B-7223A412CEF8', N'4/10/2017 8:47:17 PM' , N'11.72', N'17', N'63||93||28||50||38', N'PDF||PDF||PDF||PDF||PDF' , N'True||True||True||True||True') INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2], [Version], [doc_id], [doc_type], [doc_requirement]) VALUES (N'6BA95F5A-9A92-41AF-AD1E-588AEA39600C', N'3/11/2017 9:54:00 PM' , N'7.59', N'11', N'16||33||79||85||11', N'PDF||PDF||PDF||PDF||PDF' , N'True||True||True||True||True') INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2], [Version], [doc_id], [doc_type], [doc_requirement]) VALUES (N'567FCEA7-F0E0-473B-94EA-127923FA0D36', N'3/14/2017 1:56:22 PM' , N'10.34', N'15', N'72||30||9||82||79||81||37' , N'PDF||PDF||PDF||PDF||PDF||PDF||PDF' , N'True||True||True||True||True||True||True') INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2], [Version], [doc_id], [doc_type], [doc_requirement]) VALUES (N'236FBDFD-857D-46E1-89EC-8A6911736145', N'3/4/2016 4:15:04 AM' , N'19.31', N'28', N'35||42||16||45||13||91||13' , N'PDF||PDF||PDF||PDF||PDF||PDF||MS Word' , N'True||True||True||True||True||True||True') INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2], [Version], [doc_id], [doc_type], [doc_requirement]) VALUES (N'EEBB9163-B88F-44F1-9F9F-ABEF0418EF78', N'3/31/2017 3:45:31 AM' , N'4.83', N'7', NULL, NULL, NULL) INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2], [Version], [doc_id], [doc_type], [doc_requirement]) VALUES (N'630644C3-6001-4B32-AF45-3C7EA6F0CA70', N'6/6/2016 1:30:11 AM' , N'5.52', N'8', N'63||18||92||54', N'PDF||PDF||PDF||PDF' , N'True||True||True||True') INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2], [Version], [doc_id], [doc_type], [doc_requirement]) VALUES (N'46998AA0-CCEA-41C5-9482-E94384FB11B0', N'8/11/2016 3:58:40 AM' , N'13.79', N'20', N'57', N'MS Word', N'True') INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2], [Version], [doc_id], [doc_type], [doc_requirement]) VALUES (N'695523CE-6DA4-4228-848C-3D92096AE18B', N'3/16/2017 3:27:06 PM' , N'13.10', N'19', N'95||22||12||81||86||94||67' , N'PDF||PDF||PDF||PDF||PDF||PDF||PDF' , N'True||True||True||True||True||True||True') INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2], [Version], [doc_id], [doc_type], [doc_requirement]) VALUES (N'C7802FCC-95D0-48D6-BC02-DE0704D64F45', N'5/12/2016 2:32:26 AM' , N'2.76', N'4', NULL, NULL, NULL) INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2], [Version], [doc_id], [doc_type], [doc_requirement]) VALUES (N'80D248F3-A6DC-440E-8845-3413E4ADDB88', N'5/24/2016 2:33:25 PM' , N'2.07', N'3', NULL, NULL, NULL) INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2], [Version], [doc_id], [doc_type], [doc_requirement]) VALUES (N'8FCE0A51-770B-43AF-997E-7B0D28236372', N'3/9/2017 9:57:52 PM' , N'11.72', N'17', N'98||90||89||2||97||50' , N'PDF||PDF||PDF||MS Word||MS Word||PDF' , N'True||True||True||False||True||True') INSERT [dbo].[doc_staging] ([ID], [CreateDate], [mversion2], [Version], [doc_id], [doc_type], [doc_requirement]) VALUES (N'29318D29-1FB1-4C91-B7A3-76A6A8185497', N'5/23/2016 8:46:08 PM' , N'8.97', N'13', N'43||79||51||36', N'PDF||PDF||PDF||MS Word' , N'True||True||True||True')
The above query creates the table and inserts 12 sample records. You can see that all of the columns use nvarchar(max)
as the data type. The data itself contains common data types that will need to be converted, such as int, datetime and decimal.
Several fields are double pipe “||” delimited strings, each containing concatenated values. For example, doc_id
in the first row has the value N’63||93||28||50||38′. In MongoDB, these were nested records in the collection, and this is how the conversion brought these over to the staging table, as one flat record. Further, the doc_id
field is ordinally related to the doc_type
and doc_requirements
fields. The requirement is to transpose these values to individual rows for the final conversion. But first you have to build the ultimate, properly data typed “production” table.
The Solution
The following code builds the create table script and uses a simple iterative technique (thankfully not cursors) that reads each non-null value from each column, gleaned from the metadata in INFORMATION_SCHEMA.COLUMNS
, and tries to ascertain the data type of the values. It is limited to a small set of data types like datetime
, int
, decimal
and character-based values, but I have found it to be reliable and fast. Listing 2 shows the code in its entirety, commented to explain the process flow.
SET nocount ON; --DECLARE variables and tables DECLARE @Src NVARCHAR(max) DECLARE @Col NVARCHAR(max); DECLARE @SQL NVARCHAR(max); DECLARE @SQLCUR NVARCHAR(max); DECLARE @SQL_CASE NVARCHAR(max); DECLARE @maxlen INT; DECLARE @DB_Col TABLE ( rownum INT, column_name NVARCHAR(max) ); DECLARE @Tablesrc NVARCHAR(max) DECLARE @MAX_Rec INT, @CURRENTRECORD INT; DECLARE @MAX_RecTAB INT, @CURRENTRECORDTAB INT; DECLARE @DT NVARCHAR(max) CREATE TABLE #temp ( #temp NVARCHAR(max) ) CREATE TABLE #temp2 ( #temp NVARCHAR(max) ) --Create temp table #db_tab to hold table metadata --from INFORMATION_SCHEMA.TABLES SELECT ROWNUM =Row_number() OVER ( ORDER BY table_name), table_schema, table_name INTO #db_tab FROM INFORMATION_SCHEMA.TABLES --optional WHERE clause. If commented out, each user table in the --database will be returned WHERE table_name = 'doc_STAGING' --Set max record value for iteration through tables --(sample include only 1 table) SET @CURRENTRECORDTAB = 1 SET @MAX_RecTAB = (SELECT Max(rownum) FROM #db_tab) --Iterate through each table WHILE @CURRENTRECORDTAB <= @MAX_RecTAB BEGIN SELECT @Src = table_schema + '.' + table_name FROM #db_tab WHERE rownum = @CURRENTRECORDTAB --Remove "_staging" from the new table name SELECT @Tablesrc = Replace(table_name, '_STAGING', '') FROM #db_tab WHERE rownum = @CURRENTRECORDTAB --Populate next temp table variable "@DB_Col" with column metadata --from INFORMATION_SCHEMA.COLUMNS INSERT INTO @DB_Col SELECT ROWNUM = Row_number() OVER ( ORDER BY ordinal_position), column_name FROM INFORMATION_SCHEMA.COLUMNS WHERE table_schema + '.' + table_name = @Src --Nested iteration for each column in the staging table SET @CURRENTRECORD = 1; --Begin building dynamic CREATE TABLE statement. DROP TABLE --can be optional --SET @SQL = N'DROP TABLE IF EXISTS ' + @Tablesrc + ';' -- + Char(13) + Char(13) + ' CREATE TABLE dbo.' -- + @Tablesrc + N' (' + Char(13); SET @SQL = ' CREATE TABLE dbo.' + @Tablesrc + N' (' + Char(13); --Get the max number of columns (ordinal_position field in --INFORMATION_SCHEMA.COLUMNS) SET @MAX_Rec = (SELECT Max(ordinal_position) FROM INFORMATION_SCHEMA.COLUMNS WHERE table_schema + '.' + table_name = @Src); --Start column iteration WHILE @CURRENTRECORD <= @MAX_Rec --Set current column name using rownum BEGIN SELECT @Col = N'[' + column_name + N']' FROM @DB_Col WHERE rownum = @CURRENTRECORD; --Get the max length of all non null values for current column SET @SQLCUR = N'Select max(len(' + @Col + ')) from ' + @Src + ' where ' + @Col + ' is not null'; INSERT #temp EXEC (@SQLCUR); SET @maxlen = (SELECT TOP 1 * FROM #temp); --Round max length to nearest 10 SET @maxlen = Round(@maxlen, -1); --This does all of the data type guessing work and writing the -- output to a second temp table #temp2 SET @SQL_CASE = N'Select TOP 1 CASE WHEN Isdate(' + @Col + N') = 1 AND ISNUMERIC(' + @Col + N') <> 1 THEN ''DATETIME2'' WHEN ISNUMERIC(' + @Col + N') = 1 and ' + @Col + N' NOT LIKE ''%.%''THEN ''INT'' WHEN ISNUMERIC(' + @Col + N') = 1 and ' + @Col + N' LIKE ''%.%''THEN ''decimal(18,2)'' WHEN (isdate(' + @Col + N') <> 1 and ISNUMERIC(' + @Col + N') <> 1) AND len(' + @Col + N') > 0 AND ' + Cast(@maxlen AS VARCHAR(5)) + ' < 4000 THEN ''NVARCHAR(' + Cast(@maxlen + 20 AS VARCHAR(5)) + N')'' ELSE ''NVARCHAR(MAX)'' END From ' + @Src + ' where ' + @Col + N' is not null'; INSERT #temp2 EXEC (@SQL_CASE); --Get the data type (@DT) from #temp2 SET @DT = (SELECT TOP 1 * FROM #temp2); SET @SQL = @SQL + N' ' + @Col + N' ' + Isnull(@DT, 'NVARCHAR(255)') + N',' + Char(13); --We are finished with current column, let's delete these --and get new ones DELETE FROM #temp; DELETE FROM #temp2; SET @CURRENTRECORD = @CURRENTRECORD + 1; END; --Build final SQL statement used to print CREATE TABLE --statement (optional EXEC) SET @SQL = LEFT(@SQL, ( Len(@SQL) - 2 )) SET @SQl = @SQL + Char(13) + ')' --Print the final CREATE TABLE SCRIPT PRINT @SQL; SET @CURRENTRECORDTAB = @CURRENTRECORDTAB + 1; END DROP TABLE [#db_tab]; DROP TABLE [#temp]; DROP TABLE [#temp2];
The code itself simply iterates through one or more tables, gathering metadata for column names and ordinal positions, interrogates each column to determine its max length and assumed data type and then generates a CREATE TABLE
script, adding 20 to the rounded character-based fields. This version of the code uses Unicode data types. For the small sample table, it immediately returns the following:
CREATE TABLE dbo.doc ( [ID] NVARCHAR(60), [CreateDate] DATETIME2, [mversion2] decimal(18,2), [Version] INT, [doc_id] NVARCHAR(80), [doc_type] NVARCHAR(140), [doc_requirement] NVARCHAR(110) )
Notice that the _staging
suffix has been removed and the final table will be dbo.doc
. This, of course, is easily modifiable.
In the real project, there were tens of thousands of rows and the code still performed surprisingly well.
After running the CREATE TABLE
statement to create the empty base table, appropriately data typed, it is time to populate it from the same staging data. Recall the requirement to transpose the delimited values from the doc_id
, doc_type
and doc_requirments
columns so that these become a separate row.
You will need to transpose this:
43||79||51||36
Into this:
43
79
51
36
And further, you will need to join these doc_ids
to their precise doc_type
and doc_requirement
values.
The figure below shows the first row, using the sample data.
The next figure shows what the final result should be. Notice that the doc_ids
have maintained their ordinal positions despite their numerical order, which is crucial because the values in both doc_type
and doc_requirement
share the same positional location as the corresponding doc_id
that they are related to.
Fortunately, there is a new function in SQL Server 2016 and higher called STRING_SPLIT()
that handles this type of pivoting of delimited values with ease and does so very efficiently. The trick to getting these values to line up uses a combination of cross-applying the results of the STRING_SPLIT
function with the base table and using the window function ROW_NUMBER
to maintain the position.
The following listing shows the simple query to perform the work.
SELECT *, arv.value AS doc_id_split, Row_number() OVER (PARTITION BY id ORDER BY id ) AS RN FROM [dbo].[doc_staging] CROSS apply String_split(Replace([doc_id], '||', '|'), N'|') arv
Notice that ROW_NUMBER is partitioned and ordered on the ID
value of each row, which will return a running tally based on the number of delimited values that are cross applied.
You can then use the row number to join to the same cross applied row numbers for doc_type
and doc_requirement
. Remember each field to transpose has a variable number of actual delimited values; some rows may have five or more values, and others may only have one, but each row will contain the same number of doc_ids
as there are doc_type
and doc_requirement
values.
It is also worth pointing out that the REPLACE
function changes the double pipe to a single pipe character. Using the double pipe was a choice we made to have more assurance that there would not be actual data values that contain this combination of characters, but the downside is that the STRING_SPLIT
function will only take a single byte value for the delimiter. Hence, the double pipe must be replaced with a single character delimiter. The final code to return exactly what is needed for the sample data, all 46 new rows, is in the following listing, which includes a CTE-based solution.
WITH split_cte AS (SELECT *, arv.value AS doc_id_split, ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS RN FROM [dbo].[doc_staging] CROSS APPLY STRING_SPLIT(REPLACE([doc_id], '||', '|'), N'|') arv), split_cte2 AS (SELECT id, arv2.value AS doc_type, ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS RN FROM [dbo].[doc_staging] CROSS APPLY STRING_SPLIT(REPLACE([doc_type], '||', '|'), N'|') arv2), split_cte3 AS (SELECT id, arv3.value AS doc_requirement, ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS RN FROM [dbo].[doc_staging] CROSS APPLY STRING_SPLIT(REPLACE([doc_requirement], '||', '|') , N'|') arv3) --INSERT INTO dbo.doc SELECT sc1.ID, sc1.CreateDate, sc1.mversion2, sc1.Version, --,sc1.doc_id sc1.doc_id_split AS doc_id, sc2.doc_type, sc3.doc_requirement FROM split_cte sc1 INNER JOIN (SELECT id, split_cte2.doc_type, split_cte2.RN FROM split_cte2) sc2 ON sc1.id = sc2.id AND sc1.RN = sc2.RN INNER JOIN ( SELECT id, split_cte3.doc_requirement, split_cte3.RN FROM split_cte3 ) sc3 ON sc1.id = sc3.id AND sc1.RN = sc3.RN;
Each CTE splits one of the delimited columns, and the CTEs are joined together in the outer query. Since the OVER
clause for ROW_NUMBER
is identical in each case, they join together perfectly on the ID
and row number.
By using this code, you can easily insert the results into the newly created dbo.doc
table. I have commented out that insert statement so that you can see the results first.
Summary
I like the idea of having reusable code, even if it is to pick out certain pieces from a larger scope. I believe the code I am sharing here, which certainly has much room for improvement and extension, can be readily used for a variety of other purposes with some slight modification. One such goal would be to generate create table scripts for each table in an existing database. Instead of deriving the data type, it is possible just to use an existing data type. I will most likely go back to it many times for future projects that may involve a REST API, for example, rather than a MongoDB conversion. I know, too, that I will find a need to transpose delimited data somewhere down the road. Even if you do not have a specific need for a MongoDB conversion, I hope that the techniques shown here will help you in your projects.
References
For further information on the JSON framework used for the C# code as well as the bsondump utility, please see the following links.
https://www.newtonsoft.com/json
https://docs.mongodb.com/manual/reference/program/bsondump/
It is worth pointing out that the documentation for BSDUMP says that it is a diagnostic tool and not a tool for data ingestion or other application use.
The post A MongoDB to SQL Server Migration: Data Typing and Un-nesting appeared first on Simple Talk.
from Simple Talk http://bit.ly/2KueoR0
via
No comments:
Post a Comment