Friday, September 30, 2022

Generating Repeatable Sets Of Test Rows

In order to test graph structures, I needed a large set of random data. In some ways, this data will resemble the IMDB database I will include later in this chapter, but to make it one, controllable in size and two, random, I created this random dataset. I loaded a set of values for an account into a table and a set of interests. I wanted then to be able to load a set of random data into edge, related one account to another (one follows the other), and then an account to a random set of interests.

In this article, I will discuss a few techniques I used, starting with the simplest method using ordering by NEWID(), then using RAND() to allow the code to generate a fixed set of data.

The Setup

As the base data, I am going to create a table named: Demo.FetchRandomDataFrom that will serve as my stand-in for any table you want to randomly fetch data from.

USE Tempdb; 
GO
CREATE SCHEMA Demo;
GO
SET NOCOUNT ON;
DROP TABLE IF EXISTS  Demo.FetchRandomDataFrom
GO
CREATE TABLE Demo.FetchRandomDataFrom
(
    --this value should be the primary key of your table, 
    --ideally using the clustered index key for best 
    --performance, but definitelu unique.
    FetchRandomDataFromId INT IDENTITY(2,4) 
        CONSTRAINT PKFetchRandomDataFrom PRIMARY KEY,
    SomeOtherData CHAR(1000) NOT NULL
)
GO
--Make the rows of a consequential size. Small row size often
--is an issue with random sets of data you want to use because 
--it performs different than real tables.
INSERT INTO Demo.FetchRandomDataFrom
( SomeOtherData )
VALUES (REPLICATE('a',1000));
GO 100000 --create 100000 rows

Let’s take a look at a few rows in my table. Note that I made the surrogate key start at 2 and increment by 4. the goal was to make good and sure that the data was obviously not a sequential number. If your keys are pretty much sequential, you may not need to do some steps I have included, but unless it is a perfect set without gaps, that may come into play when trying to make repeatable outcomes from the same data.

SELECT TOP 10 *
FROM   Demo.FetchRandomDataFrom
ORDER BY FetchRandomDataFromId

This returns:

FetchRandomDataFromId SomeOtherData
--------------------- ----------------------------------
2                     aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
6                     aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
10                    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
14                    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
18                    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
22                    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
26                    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
30                    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
34                    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
38                    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

This will be my stand in for a more realistic set of data where SomeOtherData would actually have been number, dates, values, etc.

Using NEWID()

If you simply need a random set of rows from a table, you can simply do something like this:

SELECT TOP 10 FetchRandomDataFromId
FROM   Demo.FetchRandomDataFrom
ORDER BY NEWID();

This just sorts the data in a random order based on the NEWID() GUID values being generated. It is a wonderful technique you can do with almost any set of data, Executing the previous statement will return a random set of rows each time you execute it, such as:

FetchRandomDataFromId
---------------------
177818
138446
272154
171854
195266
105330
314422
203530
395666
365046

There are data sampling code types in SQL Server syntax, which you can read about here in an article by Derek Colley, but like this article also quotes, if you want a random set, most basic methods use NEWID() to get truly random outputs.

The biggest pro with that method is that it is simple and very much random. However, there are two downfalls to this method. First, no duplicated values will be found in that set. For most cases, that is great, but in other cases, you may wish to have the chance of the same value being generated multiple times (in my real case, I absolutely want the same node to end up connected to more than one node. Second, it is non repeatable. If you want to send this code to someone else, they will always get a different answer. Obviously, this may or may not be an issue, but I can’t count how many times I have generated a set, found something interesting about it and then lost the set.

Repeatable Random values

In order to make a random enough set that was repeatable, I then decided to take advantage of the RAND() function and one of its most interesting capabilities. This being that it isn’t really technical random as you can pass any number into it as a seed, and then all the next values generated are technically deterministic.

For example, execute the following code on any SQL Server:

SELECT RAND(1)
UNION ALL
SELECT RAND()
UNION ALL
SELECT RAND()
UNION ALL
SELECT RAND()
UNION ALL
SELECT RAND()
UNION ALL
SELECT RAND()

The output will be:

----------------------
0.713591993212924
0.472241415009636
0.943996112912269
0.0088524383111658
0.343443392823765
0.210903950196124

I got the same output on SQL Server 2019 and 2022 RC0. Even if you execute RAND(1) and then an hour later, the rest of the code, the “random” value that is output will be the same. So, in the following code, I will use this by letting the user set a seed (or just take a random seed when exploring data sets, but then telling the user the seed in case they wish to use it again). The following code builds on this concept using RAND() and loading a temp table of random data:

It is a lot of code, but I included quite a bit of comments to make it clear:

DROP TABLE IF EXISTS #HoldFetchRandomDataFrom
GO
CREATE TABLE #HoldFetchRandomDataFrom
(
    FetchRandomDataFromId INT PRIMARY KEY
)
DECLARE @FetchRandomDataFromId INT, --hold the fetched value
                @RowsLoopCounter INT = 0, --number of rows fetched
                @LoopCounter INT = 0, --number of loops made, 
                                --to stop if too many
                @RowsInTableCount INT, --Used to be able to set the 
                                 --RAND value
                @SafetyFactor NUMERIC(3,1) = 2, --this times the loop 
                                         --counter for max loops
                
                @DesiredRowCount INT = 100, --number of rows you want 
                                            --in output
                @Seed INT = 1; --set specific value if you want 
                              --repeatability
                --default the seed (doing this, I give it a seed that 
                --is repeatable and will override the one that was set 
                --in the next execution unless you reconnect.
                SET @seed = COALESCE(@seed,2147483647 * RAND())
                --sets the seed to start generating

                DECLARE @SetSeed INT = RAND(@seed)

                SELECT CONCAT('Seed used: ',@seed);

--load all of the data into this table. The PrepId column will 
--hold a sequential value to make random number easier
DROP TABLE IF EXISTS #PrepFetchRandomDataFrom 
CREATE TABLE #PrepFetchRandomDataFrom
(
     --Start at 0 because RAND can produce 0
        PrepId INT IDENTITY(0,1), 
        FetchRandomDataFromId INT NOT NULL
)
INSERT INTO #PrepFetchRandomDataFrom
(
    FetchRandomDataFromId
)
SELECT FetchRandomDataFromId
FROM  Demo.FetchRandomDataFrom;
--the random number will be this times RAND. (it has to be 
--captured in a generic bit of code because you could choose a 
--value > the number of rows
SET @RowsInTableCount = (SELECT COUNT(*) 
                         FROM #PrepFetchRandomDataFrom);
WHILE 1=1
 BEGIN
        --this tells you the number of loops done.
        SET @Loopcounter = @LoopCounter + 1;
        --this way, if some error keeps happening, you will 
     --eventually stop
        IF @LoopCounter >= @DesiredRowCount * @SafetyFactor
          BEGIN
                 RAISERROR('Loop stopped due to safety counter',10,1); 
                 BREAK;
          END;
        --fetch one random row
        SELECT @FetchRandomDataFromId = RAND() * @RowsInTableCount
        --do the insert in a try. You may have criteria in your 
     --table that needs to be met (uniqueness, for example)
        BEGIN TRY
        --join to get the real value from the table
        INSERT INTO #HoldFetchRandomDataFrom(FetchRandomDataFromId)
        SELECT FetchRandomDataFromId
        FROM    #PrepFetchRandomDataFrom
        WHERE  PrepId = @FetchRandomDataFromId
        
        --this is set inside the try, because the row has been 
     --added
        SET @RowsLoopCounter = @RowsLoopCounter + 1;
        --this is then end if the rows inserted match the desired 
     --rowcount
        IF @RowsLoopCounter = @DesiredRowCount
         BREAK;
        END TRY
        BEGIN CATCH
           --most likely error is a duplicated value
           DECLARE @msg NVARCHAR(1000);
           --display the message in case you do something such 
           --that the loop can never complete
           SET @msg = CONCAT('Some error occurred:' 
                                             ,ERROR_MESSAGE())
           RAISERROR(@msg,10,1) WITH NOWAIT; --immediate message 
           --Note, we keep going. This is what the safety factor 
          --was for, because an error could cause infinite loop
        END CATCH;
END;

Output the rows:

SELECT *
FROM   #HoldFetchRandomDataFrom
ORDER BY FetchRandomDataFromId;

Now since I set the seed value to 1, you will see the same output:

-----------------------
Seed used: 1

FetchRandomDataFromId
---------------------
1670
2870
3542
4606
6982
10626
12846

Clearly this is a LOT more code than simply TOP 1 ORDER BY NEWID(). Am I saying this is always necessary? Not at all. But it is a technique that you can use to generate a repeatable random set of data (as far as you care it to be random).

For example, you could test some code with a random set of data that you find an error. Instead of having to save that data off for the next person, you could just use the script. My reason for doing this is for a book I am working on. I want to be able to send a set of edges to the reader without making this huge script. So I needed two correlated random values. This method made that pretty easy.

Multiple Related Repeatable Random values

What made me first try this was I wanted to generate edges, which need a from value and a to value. When I tried to do this with the NEWID() method in a loop 300000 times, it took close to forever to complete (I gave up at 150000 rows after that executed all night).

That is how this next piece of code was written:

DROP TABLE IF EXISTS #HoldFetchRandomDataFrom
GO
CREATE TABLE #HoldFetchRandomDataFrom
(
         FetchRandomDataFromId1 INT NOT NULL,
         FetchRandomDataFromId2 INT NOT NULL,
         --no duplicate rows
         PRIMARY KEY (FetchRandomDataFromId1,
                                 FetchRandomDataFromId2),
         --no self relationship. Needs to be different values
         CHECK (FetchRandomDataFromId1 <> FetchRandomDataFromId2)
)
DECLARE @FetchRandomDataFromId1 INT, --hold the first fetched 
                                     --value
           @FetchRandomDataFromId2 INT, --the second fetched value
           @RowsLoopCounter INT = 0, --number of rows fetched
           @LoopCounter INT = 0, --number of loops made, to stop if 
                              --too many                
        @RowsInTableCount INT, --Used to be able to set the RAND 
                               --value
        @SafetyFactor NUMERIC(3,1) = 2, --this times the loop  
                               --counter for max loops
                
        @DesiredRowCount INT = 100, -- number of rows you want 
                                    --in output
        @Seed INT = 1; --set specific value if you want 
                        --repeatability
                --default the seeds (doing this, I give it a seed that 
          --is repeatable and will override the one that was set 
          --in the next execution unless you reconnect.
                SET @seed = COALESCE(@seed,2147483647 * RAND())
                --sets the seed to start generating
                DECLARE @SetSeed INT = RAND(@seed)
                SELECT CONCAT('Seed used:',@seed);
--load all of the data into this table. The PrepId column will 
--hold a sequential value to make random number easier'
DROP TABLE IF EXISTS #PrepFetchRandomDataFrom 
CREATE TABLE #PrepFetchRandomDataFrom
(
      --Start at 0 because RAND can produce 0
        PrepId INT IDENTITY(0,1), 
        FetchRandomDataFromId INT NOT NULL
)
INSERT INTO #PrepFetchRandomDataFrom
(
    FetchRandomDataFromId
)
SELECT FetchRandomDataFromId
FROM  Demo.FetchRandomDataFrom;
--the random number will be this times RAND. (it has to be 
--captured in a generic bit of code because you could choose a 
--value > the number of rows
SET @RowsInTableCount = (SELECT COUNT(*) 
                         FROM #PrepFetchRandomDataFrom);
WHILE 1=1
 BEGIN
        --this tells you the number of loops done.
        SET @Loopcounter = @LoopCounter + 1;
        --this way, if some error keeps happening, you will 
     --eventually stop
        IF @LoopCounter >= @DesiredRowCount * @SafetyFactor
          BEGIN
                 RAISERROR('Loop stopped due to safety counter',10,1); 
                 BREAK;
          END;
    --fetch one random row
    SELECT @FetchRandomDataFromId1 = RAND() * @RowsInTableCount;
    SELECT @FetchRandomDataFromId2 = RAND() * @RowsInTableCount;
        --do the insert in a try. You may have criteria in your 
     --table that needs to be met (uniqueness, for example)
        BEGIN TRY
        --join to get the real value from the table
        INSERT INTO #HoldFetchRandomDataFrom
              (FetchRandomDataFromId1,FetchRandomDataFromId2)
        VALUES((SELECT FetchRandomDataFromId
                        FROM    #PrepFetchRandomDataFrom
                        WHERE  PrepId = @FetchRandomDataFromId1),
                                        (SELECT FetchRandomDataFromId
                                        FROM    #PrepFetchRandomDataFrom
                                        WHERE  PrepId = 
                                   @FetchRandomDataFromId2));
        
        --this is set inside the try, because the row has 
     --been added
        SET @RowsLoopCounter = @RowsLoopCounter + 1;
        --this is then end if the rows inserted match the desired 
     --rowcount
        IF @RowsLoopCounter = @DesiredRowCount
         BREAK;
        END TRY
        BEGIN CATCH
                --most likely error is a duplicated value
                DECLARE @msg NVARCHAR(1000);
                --display the message in case you do something such 
          --that the loop can never complete
                SET @msg = CONCAT('Some error occurred:' 
                                             ,ERROR_MESSAGE())
                RAISERROR(@msg,10,1) WITH NOWAIT; --immediate message 
                --Note, we keep going. This is what the safety factor 
          --was for, because an error could cause infinite loop
        END CATCH
END;
--Output the rows:
SELECT *
FROM   #HoldFetchRandomDataFrom
ORDER BY FetchRandomDataFromId1;

Executing this returned:

---------------------
Seed used:1
FetchRandomDataFromId1 FetchRandomDataFromId2
---------------------- ----------------------
3542                   137378
4606                   301534
9346                   31374
10626                  110734
14702                  106954
14978                  279338

If you do a little digging, you can see the values from the previous generation in here since I used the same seed:

SELECT FetchRandomDataFromId1
FROM #HoldFetchRandomDataFrom
UNION ALL
SELECT FetchRandomDataFromId2
FROM #HoldFetchRandomDataFrom
ORDER BY FetchRandomDataFromId1

This returns:

FetchRandomDataFromId1

----------------------
1670
2870
3542
4606
6982
9346
10626

And the first, single row query returned (copied from earlier in the document:

FetchRandomDataFromId
---------------------
1670
2870
3542
4606
6982
10626
12846

You can see a new value in the sequence (which makes sense because here are 2X the number of random values generated), but still the same values.

 

The post Generating Repeatable Sets Of Test Rows appeared first on Simple Talk.



from Simple Talk https://ift.tt/NDqvOoi
via

No comments:

Post a Comment