In order to test graph structures, I needed a large set of random data. In some ways, this data will resemble the IMDB database I will include later in this chapter, but to make it one, controllable in size and two, random, I created this random dataset. I loaded a set of values for an account into a table and a set of interests. I wanted then to be able to load a set of random data into edge, related one account to another (one follows the other), and then an account to a random set of interests.
In this article, I will discuss a few techniques I used, starting with the simplest method using ordering by NEWID()
, then using RAND()
to allow the code to generate a fixed set of data.
The Setup
As the base data, I am going to create a table named: Demo.FetchRandomDataFrom
that will serve as my stand-in for any table you want to randomly fetch data from.
USE Tempdb; GO CREATE SCHEMA Demo; GO SET NOCOUNT ON; DROP TABLE IF EXISTS Demo.FetchRandomDataFrom GO CREATE TABLE Demo.FetchRandomDataFrom ( --this value should be the primary key of your table, --ideally using the clustered index key for best --performance, but definitelu unique. FetchRandomDataFromId INT IDENTITY(2,4) CONSTRAINT PKFetchRandomDataFrom PRIMARY KEY, SomeOtherData CHAR(1000) NOT NULL ) GO --Make the rows of a consequential size. Small row size often --is an issue with random sets of data you want to use because --it performs different than real tables. INSERT INTO Demo.FetchRandomDataFrom ( SomeOtherData ) VALUES (REPLICATE('a',1000)); GO 100000 --create 100000 rows
Let’s take a look at a few rows in my table. Note that I made the surrogate key start at 2 and increment by 4. the goal was to make good and sure that the data was obviously not a sequential number. If your keys are pretty much sequential, you may not need to do some steps I have included, but unless it is a perfect set without gaps, that may come into play when trying to make repeatable outcomes from the same data.
SELECT TOP 10 * FROM Demo.FetchRandomDataFrom ORDER BY FetchRandomDataFromId
This returns:
FetchRandomDataFromId SomeOtherData --------------------- ---------------------------------- 2 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 6 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 10 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 14 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 18 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 22 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 26 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 30 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 34 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 38 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
This will be my stand in for a more realistic set of data where SomeOtherData
would actually have been number, dates, values, etc.
Using NEWID()
If you simply need a random set of rows from a table, you can simply do something like this:
SELECT TOP 10 FetchRandomDataFromId FROM Demo.FetchRandomDataFrom ORDER BY NEWID();
This just sorts the data in a random order based on the NEWID()
GUID values being generated. It is a wonderful technique you can do with almost any set of data, Executing the previous statement will return a random set of rows each time you execute it, such as:
FetchRandomDataFromId --------------------- 177818 138446 272154 171854 195266 105330 314422 203530 395666 365046
There are data sampling code types in SQL Server syntax, which you can read about here in an article by Derek Colley, but like this article also quotes, if you want a random set, most basic methods use NEWID()
to get truly random outputs.
The biggest pro with that method is that it is simple and very much random. However, there are two downfalls to this method. First, no duplicated values will be found in that set. For most cases, that is great, but in other cases, you may wish to have the chance of the same value being generated multiple times (in my real case, I absolutely want the same node to end up connected to more than one node. Second, it is non repeatable. If you want to send this code to someone else, they will always get a different answer. Obviously, this may or may not be an issue, but I can’t count how many times I have generated a set, found something interesting about it and then lost the set.
Repeatable Random values
In order to make a random enough set that was repeatable, I then decided to take advantage of the RAND()
function and one of its most interesting capabilities. This being that it isn’t really technical random as you can pass any number into it as a seed, and then all the next values generated are technically deterministic.
For example, execute the following code on any SQL Server:
SELECT RAND(1) UNION ALL SELECT RAND() UNION ALL SELECT RAND() UNION ALL SELECT RAND() UNION ALL SELECT RAND() UNION ALL SELECT RAND()
The output will be:
---------------------- 0.713591993212924 0.472241415009636 0.943996112912269 0.0088524383111658 0.343443392823765 0.210903950196124
I got the same output on SQL Server 2019 and 2022 RC0. Even if you execute RAND(1)
and then an hour later, the rest of the code, the “random” value that is output will be the same. So, in the following code, I will use this by letting the user set a seed (or just take a random seed when exploring data sets, but then telling the user the seed in case they wish to use it again). The following code builds on this concept using RAND()
and loading a temp table of random data:
It is a lot of code, but I included quite a bit of comments to make it clear:
DROP TABLE IF EXISTS #HoldFetchRandomDataFrom GO CREATE TABLE #HoldFetchRandomDataFrom ( FetchRandomDataFromId INT PRIMARY KEY ) DECLARE @FetchRandomDataFromId INT, --hold the fetched value @RowsLoopCounter INT = 0, --number of rows fetched @LoopCounter INT = 0, --number of loops made, --to stop if too many @RowsInTableCount INT, --Used to be able to set the --RAND value @SafetyFactor NUMERIC(3,1) = 2, --this times the loop --counter for max loops @DesiredRowCount INT = 100, --number of rows you want --in output @Seed INT = 1; --set specific value if you want --repeatability --default the seed (doing this, I give it a seed that --is repeatable and will override the one that was set --in the next execution unless you reconnect. SET @seed = COALESCE(@seed,2147483647 * RAND()) --sets the seed to start generating DECLARE @SetSeed INT = RAND(@seed) SELECT CONCAT('Seed used: ',@seed); --load all of the data into this table. The PrepId column will --hold a sequential value to make random number easier DROP TABLE IF EXISTS #PrepFetchRandomDataFrom CREATE TABLE #PrepFetchRandomDataFrom ( --Start at 0 because RAND can produce 0 PrepId INT IDENTITY(0,1), FetchRandomDataFromId INT NOT NULL ) INSERT INTO #PrepFetchRandomDataFrom ( FetchRandomDataFromId ) SELECT FetchRandomDataFromId FROM Demo.FetchRandomDataFrom; --the random number will be this times RAND. (it has to be --captured in a generic bit of code because you could choose a --value > the number of rows SET @RowsInTableCount = (SELECT COUNT(*) FROM #PrepFetchRandomDataFrom); WHILE 1=1 BEGIN --this tells you the number of loops done. SET @Loopcounter = @LoopCounter + 1; --this way, if some error keeps happening, you will --eventually stop IF @LoopCounter >= @DesiredRowCount * @SafetyFactor BEGIN RAISERROR('Loop stopped due to safety counter',10,1); BREAK; END; --fetch one random row SELECT @FetchRandomDataFromId = RAND() * @RowsInTableCount --do the insert in a try. You may have criteria in your --table that needs to be met (uniqueness, for example) BEGIN TRY --join to get the real value from the table INSERT INTO #HoldFetchRandomDataFrom(FetchRandomDataFromId) SELECT FetchRandomDataFromId FROM #PrepFetchRandomDataFrom WHERE PrepId = @FetchRandomDataFromId --this is set inside the try, because the row has been --added SET @RowsLoopCounter = @RowsLoopCounter + 1; --this is then end if the rows inserted match the desired --rowcount IF @RowsLoopCounter = @DesiredRowCount BREAK; END TRY BEGIN CATCH --most likely error is a duplicated value DECLARE @msg NVARCHAR(1000); --display the message in case you do something such --that the loop can never complete SET @msg = CONCAT('Some error occurred:' ,ERROR_MESSAGE()) RAISERROR(@msg,10,1) WITH NOWAIT; --immediate message --Note, we keep going. This is what the safety factor --was for, because an error could cause infinite loop END CATCH; END;
Output the rows:
SELECT * FROM #HoldFetchRandomDataFrom ORDER BY FetchRandomDataFromId;
Now since I set the seed value to 1, you will see the same output:
----------------------- Seed used: 1 FetchRandomDataFromId --------------------- 1670 2870 3542 4606 6982 10626 12846
Clearly this is a LOT more code than simply TOP 1 ORDER BY NEWID()
. Am I saying this is always necessary? Not at all. But it is a technique that you can use to generate a repeatable random set of data (as far as you care it to be random).
For example, you could test some code with a random set of data that you find an error. Instead of having to save that data off for the next person, you could just use the script. My reason for doing this is for a book I am working on. I want to be able to send a set of edges to the reader without making this huge script. So I needed two correlated random values. This method made that pretty easy.
Multiple Related Repeatable Random values
What made me first try this was I wanted to generate edges, which need a from value and a to value. When I tried to do this with the NEWID() method in a loop 300000 times, it took close to forever to complete (I gave up at 150000 rows after that executed all night).
That is how this next piece of code was written:
DROP TABLE IF EXISTS #HoldFetchRandomDataFrom GO CREATE TABLE #HoldFetchRandomDataFrom ( FetchRandomDataFromId1 INT NOT NULL, FetchRandomDataFromId2 INT NOT NULL, --no duplicate rows PRIMARY KEY (FetchRandomDataFromId1, FetchRandomDataFromId2), --no self relationship. Needs to be different values CHECK (FetchRandomDataFromId1 <> FetchRandomDataFromId2) ) DECLARE @FetchRandomDataFromId1 INT, --hold the first fetched --value @FetchRandomDataFromId2 INT, --the second fetched value @RowsLoopCounter INT = 0, --number of rows fetched @LoopCounter INT = 0, --number of loops made, to stop if --too many @RowsInTableCount INT, --Used to be able to set the RAND --value @SafetyFactor NUMERIC(3,1) = 2, --this times the loop --counter for max loops @DesiredRowCount INT = 100, -- number of rows you want --in output @Seed INT = 1; --set specific value if you want --repeatability --default the seeds (doing this, I give it a seed that --is repeatable and will override the one that was set --in the next execution unless you reconnect. SET @seed = COALESCE(@seed,2147483647 * RAND()) --sets the seed to start generating DECLARE @SetSeed INT = RAND(@seed) SELECT CONCAT('Seed used:',@seed); --load all of the data into this table. The PrepId column will --hold a sequential value to make random number easier' DROP TABLE IF EXISTS #PrepFetchRandomDataFrom CREATE TABLE #PrepFetchRandomDataFrom ( --Start at 0 because RAND can produce 0 PrepId INT IDENTITY(0,1), FetchRandomDataFromId INT NOT NULL ) INSERT INTO #PrepFetchRandomDataFrom ( FetchRandomDataFromId ) SELECT FetchRandomDataFromId FROM Demo.FetchRandomDataFrom; --the random number will be this times RAND. (it has to be --captured in a generic bit of code because you could choose a --value > the number of rows SET @RowsInTableCount = (SELECT COUNT(*) FROM #PrepFetchRandomDataFrom); WHILE 1=1 BEGIN --this tells you the number of loops done. SET @Loopcounter = @LoopCounter + 1; --this way, if some error keeps happening, you will --eventually stop IF @LoopCounter >= @DesiredRowCount * @SafetyFactor BEGIN RAISERROR('Loop stopped due to safety counter',10,1); BREAK; END; --fetch one random row SELECT @FetchRandomDataFromId1 = RAND() * @RowsInTableCount; SELECT @FetchRandomDataFromId2 = RAND() * @RowsInTableCount; --do the insert in a try. You may have criteria in your --table that needs to be met (uniqueness, for example) BEGIN TRY --join to get the real value from the table INSERT INTO #HoldFetchRandomDataFrom (FetchRandomDataFromId1,FetchRandomDataFromId2) VALUES((SELECT FetchRandomDataFromId FROM #PrepFetchRandomDataFrom WHERE PrepId = @FetchRandomDataFromId1), (SELECT FetchRandomDataFromId FROM #PrepFetchRandomDataFrom WHERE PrepId = @FetchRandomDataFromId2)); --this is set inside the try, because the row has --been added SET @RowsLoopCounter = @RowsLoopCounter + 1; --this is then end if the rows inserted match the desired --rowcount IF @RowsLoopCounter = @DesiredRowCount BREAK; END TRY BEGIN CATCH --most likely error is a duplicated value DECLARE @msg NVARCHAR(1000); --display the message in case you do something such --that the loop can never complete SET @msg = CONCAT('Some error occurred:' ,ERROR_MESSAGE()) RAISERROR(@msg,10,1) WITH NOWAIT; --immediate message --Note, we keep going. This is what the safety factor --was for, because an error could cause infinite loop END CATCH END; --Output the rows: SELECT * FROM #HoldFetchRandomDataFrom ORDER BY FetchRandomDataFromId1;
Executing this returned:
--------------------- Seed used:1 FetchRandomDataFromId1 FetchRandomDataFromId2 ---------------------- ---------------------- 3542 137378 4606 301534 9346 31374 10626 110734 14702 106954 14978 279338
If you do a little digging, you can see the values from the previous generation in here since I used the same seed:
SELECT FetchRandomDataFromId1 FROM #HoldFetchRandomDataFrom UNION ALL SELECT FetchRandomDataFromId2 FROM #HoldFetchRandomDataFrom ORDER BY FetchRandomDataFromId1
This returns:
FetchRandomDataFromId1
---------------------- 1670 2870 3542 4606 6982 9346 10626
And the first, single row query returned (copied from earlier in the document:
FetchRandomDataFromId --------------------- 1670 2870 3542 4606 6982 10626 12846
You can see a new value in the sequence (which makes sense because here are 2X the number of random values generated), but still the same values.
The post Generating Repeatable Sets Of Test Rows appeared first on Simple Talk.
from Simple Talk https://ift.tt/NDqvOoi
via
No comments:
Post a Comment