Wednesday, September 2, 2020

Weighted Randomization In T-SQL

When I am not working on or writing about SQL, I am posting theme park pictures on Twitter of DisneyWorld (@DisneyPicADay) and Dollywood (@DollywoodP). The fun part of it is taking the pictures. The HARD part is deciding what to post every day. So instead of picking things from my head, I decided to harness the power of SQL Server and build a random picture tweet generator using SQL Server (I installed an Express Edition server on my machine to house my “production” data :)). I loaded all of my prepared picture files in a filetable in SQL Server, and created tables for an inventory of all of the locations and attractions at the US Disney parks (Dollywood will come later), and tagged the pictures. The entire process is something I may post about later. Mostly to show some other cool SQL Server features, but when the code is clean enough, I will put it all out on github.

Making a random number generator is pretty easy in SQL Server, just pick the top and bottom values and use the RAND() function:

DECLARE @MinValue int = 1, @MaxValue int = 10; 

SELECT FLOOR(RAND()*(@MaxValue-@MinValue+1))+@MinValue;

Run that code a few million times, it will give you a value between 1 and 10 every time, each approximately 10 percent of the time.

The thing is, while I have 77 pictures of my favorite roller coaster, Expedition Everest at Disney’s Animal Kingdom, and 70 of The Tower of Terror at Hollywood Studios, I only have 2 of the Flame Tree Restaurant at Animal Kingdom and many other things that aren’t as exciting to post about. If I randomly choose attractions using a non-weighted random number generator, it would be just as likely to get the lesser items as the same frequency as the greater items. Hence, I want my popular items to come up most frequently, but every once in a while, I want to be surprised by something different.

This is where I needed to build a weighted randomized value. (There is more than weighting to the process of choosing what to post, naturally, like picture usage, seasons, holidays, etc. This may come up in a later post.) So, for my example, I will create an object to store the list of items that I have to choose from. To keep it simple, I will create a very simple Item table, and give each item row a simple name. Just 10 items will be enough to demonstrate the concept, but this will work for a large number of items easily:

CREATE SCHEMA Assets;
GO
CREATE TABLE Assets.Item
(
   ItemId int NOT NULL CONSTRAINT PKItem PRIMARY KEY,
   Name varchar(20) NOT NULL CONSTRAINT AKItem UNIQUE
);
GO
INSERT INTO Assets.Item(ItemId, Name)
VALUES(1,'One'),(2,'Two'),(3,'Three'),(4,'Four'),
      (5,'Five'),(6,'Six'),(7,'Seven'),(8,'Eight'),
      (9,'Nine'),(10,'Ten');

Let’s start by showing the basic process. To store test data, I will create a table to hold the values that are randomly chosen. It is always important to test out your random number generators to make sure you get a reasonable distribution, at least most of the times you execute it.

CREATE TABLE #HoldItemPicks(ItemId int, RandomNumber int);

Then I will insert 10000 items using GO <repeat count>:

SET NOCOUNT ON; --This is essential when doing GO <repeat>
GO
TRUNCATE TABLE #HoldItemPicks; --for repeated testing
GO
DECLARE @MinRand int = 1, @MaxRand int = 10; --we know the id values
--get the random number
DECLARE @RandomNumber int = FLOOR(RAND()*(@MaxRand-@MinRand+1))+@MinRand;

--insert the rows
INSERT INTO #HoldItemPicks(ItemId, RandomNumber)
VALUES(@RandomNumber,@RandomNumber);
GO 10000

--output the rows
SELECT ItemId, COUNT(*)
FROM #HoldItemPicks
GROUP BY ItemId
ORDER BY COUNT(*) DESC;

Take a look at the distribution of values, and you should see something very similar to the following. It is a RANDOM number generator, so you might actually get the same values over and over. But most of the time, the distribution will be close to what you expect, with some random value having a few more values, and others having slightly less:

ItemId
----------- -----------
4           1071
10          1062
1           1019
9           1007
8           1006
5           1003
3           1003
7           952
6           949
2           928

That method of using RAND() directly works fine, if you have no gaps in your sequence, but that is not ideal in most cases. An alternative is to use NEWID() to sort the output randomly, and insert the values in the table.

TRUNCATE TABLE #HoldItemPicks;
GO

--Alternative:
INSERT INTO #HoldItemPicks(ItemId)
SELECT TOP (1) ItemId
FROM Assets.Item
ORDER BY NEWID();
GO 10000

SELECT ItemId, COUNT(*)
FROM #HoldItemPicks
GROUP BY ItemId
ORDER BY COUNT(*) DESC

There will generally be no difference in this output from method, other than performance since instead of a quick random number, you have to generate a guid for every row, sort the values, and grab a row:

ItemId
----------- -----------
5           1066
9           1061
3           1029
8           1003
4           1002
6           990
1           976
7           969
2           955
10          949

But to the problem at hand. I will want to make sure my best pictures get chosen more often that the others. Let’s say 9 and 10 are the ones that we want to see more often, so let’s make sure that they get 10 times the attention as everyone else. I will do this by adding a column to the Item table, like this:

ALTER TABLE Assets.Item
   ADD ProbabilityFactor tinyint NOT NULL
           CONSTRAINT DFLTItem_ProbabilityFactor DEFAULT (1) --DEFAULT to lowest
           CONSTRAINT CHKItem__ProbabilityFactorDomain 
                    CHECK (ProbabilityFactor BETWEEN 1 AND 100); --can increase up to 100 X
GO
--now set 9 and 10 to 10X so they can be 10 times more likely to show up.
UPDATE Item
SET ProbabilityFactor =ProbabilityFactor * 10
FROM Assets.Item
WHERE itemId IN (9,10);

Checking out the data:

SELECT *
FROM Assets.Item;

Pretty clear, 9 and 10 are set to be 10 times more likely to be chosen.

ItemId      Name                 ProbabilityFactor
----------- -------------------- -----------------
1           One                  1
2           Two                  1
3           Three                1
4           Four                 1
5           Five                 1
6           Six                  1
7           Seven                1
8           Eight                1
9           Nine                 10
10          Ten                  10

But how to implement this? I took my inspiration from lottery tickets and a post I can’t find again that was built for a different language. After I had finished, I found this post on StackOverflow with the base workings of a similar approach (https://stackoverflow.com/questions/26971198/create-a-random-selection-weighted-on-number-of-points-sql). The idea is, if 9 and 10 need to be 10 times more likely to be chosen, then I need to give them 10 times more tickets. I will do this using the following VIEW object:

CREATE OR ALTER VIEW Assets.ItemLotterySetup
AS
   SELECT ItemId, 
   
   --calculate ticket start the RunningTotal from the previous item
           LAG(BaseRows.RunningTotal,1,0) OVER (ORDER BY ItemId) AS TicketStart,

   --then the end is the running total + current probability (or # of tickets for the lottery)
           BaseRows.EndingProbabilityFactorFactorBase + 
                LAG(BaseRows.RunningTotal,1,0) OVER (ORDER BY ItemId) -1 AS TicketEnd
FROM (
       SELECT ItemId, ProbabilityFactor AS EndingProbabilityFactorFactorBase,
              --this provides the value for the start and end seperated 
              SUM(Item.ProbabilityFactor) OVER (ORDER BY ItemId) AS RunningTotal
FROM Assets.Item ) AS BaseRows;

Query this view, and you will see the following, which we can use to do a between on the start and end values with a random value:

ItemId      BaseTicketsStart BaseTicketsEnd
----------- ---------------- --------------
1           0                0
2           1                1
3           2                2
4           3                3
5           4                4
6           5                5
7           6                6
8           7                7
9           8                17
10          18               27

So let’s truncate the temp table and run this test again (if you have a lot of values, and a non-volatile list, you could store this to a permanent table and index the values too):

TRUNCATE TABLE #HoldItemPicks;
GO

DECLARE @MinTicket int, @MaxTicket int;

--We need the start and end values. I calculate them, because you 
--could also add a filter to the process so the start and end
--change
SELECT @MinTicket = MIN(ItemLotterySetup.TicketStart),
       @MaxTicket = MAX(ItemLotterySetup.TicketEnd)
FROM Assets.ItemLotterySetup;

--grab a random number
DECLARE @RandomNumber int = FLOOR(RAND()*(@MaxTicket-@MinTicket+1))+@MinTicket;

--fetch the value
INSERT INTO #HoldItemPicks(ItemId, RandomNumber)
SELECT ItemId, @RandomNumber
FROM Assets.ItemLotterySetup
WHERE @RandomNumber BETWEEN ItemLotterySetup.TicketStart AND ItemLotterySetup.TicketEnd;
GO 10000

If you look at the data from the values you have generated:

SELECT ItemId, COUNT(*)
FROM #HoldItemPicks
GROUP BY ItemId
ORDER BY COUNT(*) DESC;

Now you can see that 9 and 10 are indeed picked ~10 times more often:

ItemId
----------- -----------
9           3614
10          3598
2           378
1           366
7           364
3           347
4           345
8           341
6           339
5           308

The post Weighted Randomization In T-SQL appeared first on Simple Talk.



from Simple Talk https://ift.tt/3hR4W78
via

No comments:

Post a Comment