Tuesday, December 25, 2018

Saving Data to Various Formats, Using SQL

I enjoy pulling the data out of AdventureWorks. It is a great test harness. What is the quickest way of doing it? Well, everyone knows it is native BCP, but how much faster is that format than tab-delimited or comma-delimited BCP? Can we quickly output data in XML? Is there a way of outputting array-in-array JSON reasonably quickly? Of course, the answer is going to vary from system to system, and across versions, but any data about this is usually welcome.

In addition to these questions, I wanted to know more about how much space these files take up, either raw or zipped. We’re about to find out. We’ll test all that, using good ol’ BCP and SQLCMD.

My motivation for doing this was to explore ways of quickly transferring data to MongoDB. to test out a way of producing array-in-array JSON at a respectable turn of speed. It turned out to be tricky. The easy and obvious ways were slow.

How? Why?

The reason I like AdventureWorks is that it has a bit of an assault course there. There are illegal column names, reserved words used as columns, CLR types, and free text search. Not as much of this as I’d like, of course, but it makes things a bit more tricky, so it is good for testing, along with old Pubs with its awkward obsolete data types.

Because the SQL Server implementation of XML and JSON are rather delicate creatures that quail at the smell of a CLR Datatype, or a column name with an embedded space, we generally need some supporting logic, rather than belt out the tables to file just by SELECT *.

As most people know, you can use BCP very simply, just specifying each table. Here we output in Tab-delimited format in utf-8, which is a format that Excel likes. It is safe as long as you don’t include tab characters in your spreadsheet cells! Note that xp_cmdshell must be enabled to try this example.

--over-ride this path if you need to.
DECLARE @ourPath sysname = 'OurPath\TabDelimited\';
DECLARE @Database sysname = Db_Name(); 
DECLARE @command NVARCHAR(4000)= '
EXECUTE xp_cmdshell ''bcp  ?  out '+@ourPath +'?.Tab -c -C 65001 -d'
    +@Database+' -T -S '+@@servername +''', NO_OUTPUT;'
EXECUTE sp_msforeachtable @command

Native and “comedy-limited” formats aren’t much harder. Proper CSV format output from the tables of a database is only possible via ODBC.

Creating the Test Harness

For this task, I set up several directories on the server for the different types of output

Then ran the script from The Code section in this article which wrote out the data from AdventureWorks into each directory in the different formats. I included within the batch all the necessary temporary procedures along with a routine that switched in the configuration that allowed me to use xp_cmdshell and then switched it off again.

Into each directory was written all the files, one for each table, like this.

One can, and should of course, open these files up and check on them. Here is the Tab-delimited format.

OK, it would be good not to have the quoted identifiers in the filename [Human Resources]. [Employee].Tab, perhaps, but they are legal. If they irritate you, it is very quick and easy to remove them.

Get-ChildItem -path 'MyDirectory’ -Filter "*.*" -Recurse |
  foreach{if ($_.name -match '(\[|\])')
      { Rename-Item -NewName {$_.name -replace '(\[|\])',''}}}

Mixed Usefulness of the Different Formats

Not all these formats are particularly useful. Tab-delimited is fine just so long as you don’t have tabs in any strings. The comma-separated format isn’t CSV. If you run …

SQLCMD -d AdventureWorks2016 -E -u  -s, -W -Q "set nocount on; SELECT * from (values ('This text has commas, like this, and so should be delimited.','comedy-limited, nice but unreliable'),(',,,,,','This text has no commas'))f(one,two)"

… you will see instantly that it hasn’t coped. It isn’t CSV.

one,two

—,—

This text has commas, like this, and so should be delimited.,comedy-limited, nice but unreliable

,,,,,,This text has no commas

The Results

The XML via BCP and the native BCP both worked very rapidly. Tab delimited and comma-delimited took a little bit longer. SQLCMD and the object-in-array JSON were slightly laggard but respectable. Bringing up the rear was the array-in-array-format. I originally tried exporting array-in-array JSON via XML. However, it turned out to be too slow. Although XML is quick to export, the overhead of converting it to JSON is slow. The method of using OpenJSON was again prohibitively slow, taking two or three minutes.

A word of caution with these results. They can vary considerably, even on the same server and general setup.

The performance of array-in-array JSON was as good as I can manage to get. CLR, or even PowerShell could easily do better. I’d prefer to wait for SQL Server to support the format. However, this is only five times as long as the fastest, BCP. It is usable for most purposes.

What about the space they take up?

The advantage of the array-in-array format of JSON is that the text takes half the space of the object -in-array format. This advantage almost vanishes when the files are zipped, though.

XML and Json Object-in-Array format both take around twice as much space as the simpler tabular formats, which is unsurprising as they are more versatile in the structure of the information they can hold. The large space used by comma-delimited files is due to the fact that I had difficulty storing the file in UTF-8 format. I opted for Unicode instead. This disappeared when the files were zipped. Comma-delimited format was only there for the sake of completion.

All the formats, when zipped, take roughly the same space. This shouldn’t be a surprise because they all held exactly the same information. It is just that some formats repeated it a lot!

The Code

So, here is the SQL to run the comparative tests. They take some time to run. Something like three and a half minutes for AdventureWorks. You’ll see that we start by hurriedly ensuring that we are allowed to use XP_CmdShell and changing the configuration if necessary. Then we run the tests and finally, if necessary, close off the configuration to prevent it being used.

To run this, change the path values to a directory on your server if required. Remember that you are writing these files to your server, not the workstation.

I describe the methodology for the test harness, and how to set it up as a SQL Prompt snippet, in an article ‘How to record T-SQL execution times using a SQL Prompt snippet’.

These routines need to access temporary stored procedures. These are all stored on GitHub, and I’ll keep those updated. They are here, at JSONSQLServerRoutines along with several others that I use for processing JSON. For this example, create the #SaveJsonDataFromTable and #ArrayInArrayJsonDataFromTable temporary stored procedures in the same query window before running the following code.

/* lets just check the config and make sure that xp_cmdshell is enabled. */
DECLARE @Settings TABLE (
    name sysname, minimum INT, Maximum INT, config_value INT, run_value INT);
INSERT INTO @Settings (name, minimum, Maximum, config_value, run_value)
  EXECUTE sp_configure @configname = 'show advanced options';
IF NOT EXISTS (
  SELECT * FROM @Settings WHERE name = 'show advanced options'
  AND run_value = 1)
  BEGIN
    EXECUTE sp_configure 'show advanced options', 1;
    RECONFIGURE;
  END;
INSERT INTO @Settings (name, minimum, Maximum, config_value, run_value)
  EXECUTE sp_configure @configname = 'xp_cmdshell';
IF NOT EXISTS (
  SELECT * FROM @Settings WHERE name = 'xp_cmdshell'
  AND run_value = 1)
  BEGIN
    EXECUTE sp_configure 'xp_cmdshell', 1;
    RECONFIGURE;
  END;
go
/* start of timed run */
DECLARE @log TABLE 
  (TheOrder INT IDENTITY(1,1), 
  WhatHappened varchar(200),
  WHENItDid  Datetime2 DEFAULT GETDATE())
USE adventureworks2016
----start of timing
INSERT INTO @log(WhatHappened) SELECT 'Starting Writing out every table'
--place at the start
DECLARE @ourPath9 sysname = 'C:\Data\RawData\AdventureWorks\XML\';
Declare @command9 NVARCHAR(4000)= 
'EXEC xp_cmdshell ''bcp "SELECT * FROM ? FOR XML AUTO"  queryout "'
     +@ourPath9+'?.xml" -S '+@@Servername+' -T -c -C 65001 -t'', NO_OUTPUT;'
EXECUTE sp_msforeachtable @command9
INSERT INTO @log(WhatHappened) 
    SELECT 'Writing out every table as XML  took '
    --log the time taken to get to this point
/* first lets try using JSON array--in-array format */
DECLARE @ourPath1 sysname = 'C:\Data\RawData\AdventureWorks\JSONArrayInArray\';
Declare @command1 NVARCHAR(4000)= '
DECLARE @Json NVARCHAR(MAX)
EXECUTE #ArrayInArrayJsonDataFromTable @TableSpec=''?'',@JSONData=@json OUTPUT
CREATE TABLE ##myTemp (Bulkcol nvarchar(MAX))
INSERT INTO ##myTemp (Bulkcol) SELECT @JSON
EXECUTE xp_cmdshell ''bcp ##myTemp out "'+@ourPath1
     +'?.JSON" -c -C 65001 -S '+@@Servername+' -T '', NO_OUTPUT;
DROP TABLE ##myTemp'
EXECUTE sp_msforeachtable @command1
INSERT INTO @log(WhatHappened) 
SELECT 'Writing out every table as Array-in-Array JSON took '
--log the time taken to get to this point
DECLARE @ourPath2 sysname = 'C:\Data\RawData\AdventureWorks\JSONObjectInArray\';
DECLARE @command2 NVARCHAR(4000)= '
DECLARE @Json NVARCHAR(MAX)
EXECUTE #SaveJsonDataFromTable @TableSpec=''?'',@JSONData=@json OUTPUT
CREATE TABLE ##myTemp (Bulkcol nvarchar(MAX))
INSERT INTO ##myTemp (Bulkcol) SELECT @JSON
EXECUTE xp_cmdshell ''bcp ##myTemp out '+@ourPath2
    +'?.JSON -c -C 65001 -S '+@@Servername+' -T '', NO_OUTPUT;
DROP TABLE ##myTemp'
EXECUTE sp_msforeachtable @command2
INSERT INTO @log(WhatHappened) 
SELECT 'Writing out every table as Object-in-Array JSON took '
--log the time taken to get to this point
DECLARE @ourPath3 sysname = 'C:\Data\RawData\AdventureWorks\TabDelimited\';
DECLARE @Database3 sysname = Db_Name(); --over-ride this if you need to.
DECLARE @command3 NVARCHAR(4000)= '
EXECUTE xp_cmdshell ''bcp  ?  out '+@ourPath3+'?.Tab -c -C 65001 -d'
    +@Database3+' -S '+@@Servername+' -T '', NO_OUTPUT;'
EXECUTE sp_msforeachtable @command3
INSERT INTO @log(WhatHappened) 
SELECT 'Writing out every table as tab-delimited BCP took '
--log the time taken to get to this point
DECLARE @ourPath4 sysname = 'C:\Data\RawData\AdventureWorks\NativeBCP\';
DECLARE @Database4 sysname = Db_Name(); --over-ride this if you need to.
DECLARE @command4 NVARCHAR(4000)= '
EXECUTE xp_cmdshell ''bcp  ?  out '+@ourPath4+'?.data -N -d'
    +@Database4+' -S '+@@Servername+' -T '', NO_OUTPUT;'
EXECUTE sp_msforeachtable @command4
INSERT INTO @log(WhatHappened) 
SELECT 'Writing out every table as native BCP took '
--log the time taken to get to this point
DECLARE @ourPath6 sysname = 'C:\Data\RawData\AdventureWorks\CSV\';
DECLARE @Database6 sysname = Db_Name(); --over-ride this if you need to.
DECLARE @command6 NVARCHAR(4000)= '
EXECUTE xp_cmdshell ''sqlcmd  -d '
    +@Database6+' -u -E -h -1 -s, -W -Q "set nocount on; SELECT * FROM ?" -S '
    +@@Servername+' -o '+@ourPath6++'?.CSV'', NO_OUTPUT;'
EXECUTE sp_msforeachtable @command6
INSERT INTO @log(WhatHappened) 
SELECT 'Writing out every table as SQLCMD comedy-limited took '
--log the time taken to get to this point
SELECT ending.whathappened, 
    DateDiff(ms, starting.whenItDid,ending.WhenItDid) AS ms 
FROM @log starting
INNER JOIN @log ending ON ending.theorder=starting.TheOrder+1
UNION all
SELECT 'Total', DateDiff(ms,Min(WhenItDid),Max(WhenItDid))  
FROM @log 
ORDER BY ms asc
--list out all the timings
go
DECLARE @Settings TABLE 
    (name sysname, minimum INT, Maximum INT, config_value INT, run_value INT);
INSERT INTO @Settings (name, minimum, Maximum, config_value, run_value)
  EXECUTE sp_configure @configname = 'show advanced options';
IF NOT EXISTS (
  SELECT * FROM @Settings WHERE name = 'show advanced options'
  AND run_value = 1)
  BEGIN
    EXECUTE sp_configure 'show advanced options', 1;
    RECONFIGURE;
  END;
INSERT INTO @Settings (name, minimum, Maximum, config_value, run_value)
  EXECUTE sp_configure @configname = 'xp_cmdshell';
IF NOT EXISTS (
  SELECT * FROM @Settings WHERE name = 'xp_cmdshell'
  AND run_value = 0)
  BEGIN
    EXECUTE sp_configure 'xp_cmdshell', 0;
    RECONFIGURE;
  END;

Conclusions

It is very useful to run tests like these, and I caught several bugs while doing so. However, I much prefer testing things rather than waving my hands and arguing.

What I’d conclude from all this is that JSON is a good way of storing data and isn’t really any bulkier as a format than any other way of doing it. When data is zipped, it seems to boil down to the same size whatever the format you use. Obviously, Native BCP is the quickest and neatest way of ‘dumping’ data but it isn’t easy to read, and you cannot prepare data from another source and import it as native BCP because it isn’t a data transport medium. If you want an unzipped tabular format with the robustness of JSON and the size of CSV, then JSON Array-in-array is the one to go for.

None of the delimited formats are safe in that it is always possible that column value could contain an unescaped character that is used as a delimiter. CSV, in contrast, would be fine if there were a reliable way of doing it in BCP. If you want a safe way of transporting data, then XML or JSON is an obvious choice but who would use XML when there was an alternative?

The post Saving Data to Various Formats, Using SQL appeared first on Simple Talk.



from Simple Talk http://bit.ly/2EPSsgn
via

No comments:

Post a Comment