Tuesday, January 2, 2024

Two-Dimensional Interval Packing Challenge

Packing intervals is a classic SQL task that involves packing groups of intersecting intervals to their respective continuous intervals. In mathematics, an interval is the subset of all values of a given type, e.g., integer numbers, between some low value and some high value. In databases, intervals can manifest as date and time intervals representing things like sessions, prescription periods, hospitalization periods, schedules, or numeric intervals representing things like ranges of mile posts on a road, temperature ranges, and so on.

An example for an interval packing task is packing intervals representing sessions for billing purposes. If you conduct a web search of the term packing intervals SQL, you’ll find scores of articles on the subject, including my own.

I’ve dealt with many interval packing tasks in the past and used different techniques to handle those. But they all had something in common that kept the complexity level at bay—they were all what you can think of as one-dimensional interval packing tasks. That is, each input represented one interval. One-dimensional intervals can be depicted spatially as segments on a line, and the packing task can be illustrated as identifying the delimiters of the groups of intersecting segments.

During a T-SQL class, while discussing packing challenges, Paul Wilcox from Johns Hopkins inquired whether I dealt with two-dimensional packing tasks, and if I had any references to resources for addressing those. He pointed to a forum post he submitted a few years earlier with a problem that came about due to a scheduling issue he came across when working at a previous school district. I remember Paul trying to use hand gestures representing two dimensional polygons to illustrate the idea. Up to that point it didn’t even occur to me that two-dimensional packing problems even existed, so obviously I had nothing to give him at the time. I promised to look at it after class, though, and address it next morning.

I sometimes refer to solutions or techniques as being elegant. What was interesting about this problem is that the challenge itself was very elegant and compelling. Moreover, I could tell that Paul was very experienced and bright, so if after years since he first stumbled into this task, he was still looking for additional solutions, it had to be a very interesting one.

In this article I present Paul’s two-dimensional packing challenge and my solution. Some of the illustrations I use are like Paul’s own depictions of the logical steps that are involved in the solution.

As a practice tool, I recommend that you attempt to solve the challenge yourself before looking at Paul’s and my solutions. I recommend reading Paul’s post to understand the task, but will also provide full details of the challenge here.

Note: the code for this article can be downloaded here.

The Two-Dimensional Interval Packing Challenge

The task at hand involves student class schedules stored in a table called Schedule, which you create and populate using the following code:

SET NOCOUNT ON;
USE tempdb;
DROP TABLE IF EXISTS dbo.Schedule;
CREATE TABLE dbo.Schedule
(
  id         INT     NOT NULL 
     CONSTRAINT PK_Schedule PRIMARY KEY,
  student    CHAR(3) NOT NULL,
  fromdate   INT     NOT NULL,
  todate     INT     NOT NULL,
  fromperiod INT     NOT NULL,
  toperiod   INT     NOT NULL
);
INSERT INTO dbo.Schedule(id, student, fromdate, todate, 
fromperiod, toperiod) VALUES
    (1, 'Amy',  1,  7,  7,  9),
    (2, 'Amy',  3,  9,  5,  8), 
    (3, 'Amy', 10, 12,  1,  3), 
    (4, 'Ted',  1,  5, 11, 14),
    (5, 'Ted',  7, 11, 13, 16);

Each row has the student in question, a begin date (fromdate) and end date (todate), and a period start time (fromperiod) and period end time (toperiod). Note that the sample data uses integers instead of date and time values for simplicity.

To help understand the current data, it can be convenient to depict it graphically. This can be achieved with the following spatial query, forming a rectangle from each row:

SELECT student,
  GEOMETRY::STGeomFromText('GEOMETRYCOLLECTION('
  + STRING_AGG(CONCAT('POLYGON((',fromdate,' ',fromperiod, ',', 
                                fromdate,' ',toperiod + 1, ',', 
                                todate + 1,' ',toperiod + 1,',', 
                                todate + 1,' ',fromperiod,',', 
                                fromdate,' ',fromperiod ,    
                  '))'), ',') 
 + ')', 0) AS shape
FROM dbo.Schedule
GROUP BY student;

Figure 1 has the graphical output of this query as can be seen in the SSMS Spatial results tab.

Figure 1: Unnormalized Schedule Depicted Graphically

The X axis represents the date ranges, and the Y axis represents the period ranges.

Notice that a student may have overlapping schedules, as is the case with Amy’s schedules 1 and 2. You can think of the current data as representing an unnormalized form of the schedule.

Sometimes you need to query the data to identify a schedule that contains a certain date and period. For example, the following query looks for Amy’s schedule that contains date 4 and period 8:

SELECT id, student, fromdate, todate, fromperiod, toperiod
FROM dbo.Schedule
WHERE student = 'Amy'
  AND 4 BETWEEN fromdate AND todate
  AND 8 BETWEEN fromperiod AND toperiod;

This query generates the following output, showing multiple matches:

id          student fromdate    todate      fromperiod  toperiod
----------- ------- ----------- ----------- ----------- --------
1           Amy     1           7           7           9
2           Amy     3           9           5           8

Multiple matches for a single (date, period) point are only possible due to the unnormalized nature of the schedule. The challenge is to create a normalized state of the schedule, with distinct, nonoverlapping date range and period range combinations. Assuming such a normalized state is stored in a table, you then have a guarantee for no more than one matching row per input (date, period) point.

The tricky part is that there may be multiple arrangements that can achieve this goal. For the sake of this challenge, you just need to pick one that produces a deterministic result. For example, form a result row for each consecutive range of dates with the same period range. Figure 2 has a graphical depiction of the desired normalized schedule.

Figure 2: Desired Normalized Schedule

Informally, you’re forming the smallest number of rectangles possible using only vertical cuts. Obviously, you could go for a strategy that uses horizontal cuts. Or even something more sophisticated that uses a combination of vertical and horizontal cuts. Anyway, assuming the vertical-cuts approach, following is the desired output of your solution:

student fromdate    todate      fromperiod  toperiod
------- ----------- ----------- ----------- -----------
Amy     1           2           7           9
Amy     3           7           5           9
Amy     8           9           5           8
Amy     10          12          1           3
Ted     1           5           11          14
Ted     7           11          13          16

Told you it’s a formidable challenge. Now, to work!

Unpack/Pack Solution for One Dimensional Packing

My approach to solving the two-dimensional packing problem is to rely on a classic technique for handling one-dimensional packing, which you can think of as the Unpack/Pack technique, in multiple steps of the solution. So first, let me describe the Unpack/Pack technique for one-dimensional packing.

Suppose that in our Schedule table you only had period ranges and needed to normalize them. This is of course a completely contrived example, but I’m using it for convenience since we already have sample data available to work with. Here’s a query showing just the periods:

SELECT id, student, fromperiod, toperiod
FROM dbo.Schedule
ORDER BY student, fromperiod, toperiod;

This query generates the following output:

id          student fromperiod  toperiod
----------- ------- ----------- -----------
3           Amy     1           3
2           Amy     5           8
1           Amy     7           9
4           Ted     11          14
5           Ted     13          16

Suppose that the task was to pack intersecting periods per student. Here’s the desired output of the solution with the packed periods:

student fromperiod  toperiod
------- ----------- -----------
Amy     1           3
Amy     5           9
Ted     11          16

As mentioned earlier, there are many solutions out there for packing intervals. The Unpack/Pack solution involves the following steps:

  1. Unpack each interval to the individual values that constitute the set.
  2. Compute a group identifier using a classic islands technique.
  3. Group the data by the group identifier to form the packed intervals.

Let’s start with Step 1. You’re supposed to unpack each interval to the individual values that constitute the interval set. Recall that an interval is the set of all values between the low and high values representing the interval delimiters. For example, given that our sample data uses an integer type for the period delimiters, Amy’s interval with fromperiod 5 and toperiod 8 should be unpacked to the set {5, 6, 7, 8}.

If you’re using Azure SQL or SQL Server 2022 or above, you can achieve this with the GENERATE_SERIES function, like so:

SELECT S.id, S.student, P.value AS p
FROM dbo.Schedule AS S
  CROSS APPLY GENERATE_SERIES(S.fromperiod, S.toperiod) AS P
ORDER BY S.student, p, S.id;

This query generates the following output:

id          student p
----------- ------- -----------
3           Amy     1
3           Amy     2
3           Amy     3
2           Amy     5
2           Amy     6
1           Amy     7
2           Amy     7
1           Amy     8
2           Amy     8
1           Amy     9
4           Ted     11
4           Ted     12
4           Ted     13
5           Ted     13
4           Ted     14
5           Ted     14
5           Ted     15
5           Ted     16

If you don’t have access to the GENERATE_SERIES function, you can either create your own, or use a numbers table.

Step 2 involves computing a unique group identifier per packed interval. Observe that each packed interval contains a consecutive range of unpacked p values, possibly with duplicates. As an example, see Amy’s range between 5 and 9: {5, 6, 7, 7, 8, 8, 9}. The group identifier can be computed as the p value, minus the dense rank value (because there may be duplicates) based on p value ordering, within the student partition. Here’s the query that achieves this, showing also the dense rank values separately for clarity:

SELECT S.id, S.student, P.value AS p,
  DENSE_RANK() OVER(PARTITION BY S.student ORDER BY P.value) 
                                                     AS drk,
  P.value - DENSE_RANK() OVER(PARTITION BY S.student 
                              ORDER BY P.value) AS grp_p
FROM dbo.Schedule AS S
  CROSS APPLY GENERATE_SERIES(S.fromperiod, S.toperiod) AS P
ORDER BY S.student, p, S.id;

This query generates the following output:

id          student p      drk       grp_p
----------- ------- ------ --------- --------
3           Amy     1      1         0
3           Amy     2      2         0
3           Amy     3      3         0
2           Amy     5      4         1
2           Amy     6      5         1
1           Amy     7      6         1
2           Amy     7      6         1
1           Amy     8      7         1
2           Amy     8      7         1
1           Amy     9      8         1
4           Ted     11     1         10
4           Ted     12     2         10
4           Ted     13     3         10
5           Ted     13     3         10
4           Ted     14     4         10
5           Ted     14     4         10
5           Ted     15     5         10
5           Ted     16     6         10

You get a group identifier (grp_p) that is the same per packed interval group and unique per group.

Step 3 then groups the unpacked data by the student and the group identifier, and computes the packed interval delimiters with the MIN(p) an MAX(p) aggregates, like so:

WITH C AS
(
  SELECT S.id, S.student, P.value AS p,
    DENSE_RANK() OVER(PARTITION BY S.student ORDER BY P.value) 
                                                       AS drk,
    P.value - DENSE_RANK() OVER(PARTITION BY S.student 
                                ORDER BY P.value) AS grp_p
  FROM dbo.Schedule AS S
    CROSS APPLY GENERATE_SERIES(S.fromperiod, S.toperiod) AS P
)
SELECT student, MIN(p) AS fromperiod, MAX(p) AS toperiod
FROM C
GROUP BY student, grp_p
ORDER BY student, fromperiod;

This query generates the following desired output:

student fromperiod  toperiod
------- ----------- -----------
Amy     1           3
Amy     5           9
Ted     11          16

This technique is obviously not very optimal when the sets representing the intervals can contain large numbers of members. But it works quite well for small sets and is essential for handling the two-dimensional packing challenge.

Solution for Two-Dimensional Packing

Armed with the Unpack/Pack technique, you are ready to tackle the two-dimensional packing task. The trick is to realize that you can use it multiple times in the solution—once per dimension.

In Step 1, you unpack the two-dimensional date range, period range values to the individual date (d), period (p) values. It’s probably easiest to think of this step in graphical terms, as pixelating (thanks Paul for the terminology!) the rectangles representing each student class schedule from Figure 1 shown earlier. Here’s the code to achieve this step:

SELECT S.id, S.student, D.value AS d, P.value AS p
FROM dbo.Schedule AS S
  CROSS APPLY GENERATE_SERIES(S.fromdate, S.todate) AS D
  CROSS APPLY GENERATE_SERIES(S.fromperiod, S.toperiod) AS P
ORDER BY S.student, d, p, S.id;

This query generates the following output:

id          student d           p
----------- ------- ----------- -----------
1           Amy     1           7
1           Amy     1           8
1           Amy     1           9
1           Amy     2           7
1           Amy     2           8
1           Amy     2           9
2           Amy     3           5
2           Amy     3           6
1           Amy     3           7
2           Amy     3           7
1           Amy     3           8
2           Amy     3           8
1           Amy     3           9
2           Amy     4           5
2           Amy     4           6
1           Amy     4           7
2           Amy     4           7
1           Amy     4           8
2           Amy     4           8
1           Amy     4           9
2           Amy     5           5
2           Amy     5           6
1           Amy     5           7
2           Amy     5           7
1           Amy     5           8
2           Amy     5           8
1           Amy     5           9
2           Amy     6           5
2           Amy     6           6
1           Amy     6           7
2           Amy     6           7
1           Amy     6           8
2           Amy     6           8
1           Amy     6           9
2           Amy     7           5
2           Amy     7           6
1           Amy     7           7
2           Amy     7           7
1           Amy     7           8
2           Amy     7           8
1           Amy     7           9
2           Amy     8           5
2           Amy     8           6
2           Amy     8           7
2           Amy     8           8
2           Amy     9           5
2           Amy     9           6
2           Amy     9           7
2           Amy     9           8
3           Amy     10          1
3           Amy     10          2
3           Amy     10          3
3           Amy     11          1
3           Amy     11          2
3           Amy     11          3
3           Amy     12          1
3           Amy     12          2
3           Amy     12          3
...

You can use the following helper query to depict Step 1’s result graphically:

WITH Pixels AS
(
  SELECT S.id, S.student, D.value AS d, P.value AS p
  FROM dbo.Schedule AS S
    CROSS APPLY GENERATE_SERIES(S.fromdate, S.todate) AS D
    CROSS APPLY GENERATE_SERIES(S.fromperiod, S.toperiod) AS P
)
SELECT student,
  GEOMETRY::STGeomFromText('GEOMETRYCOLLECTION('
    + STRING_AGG(CONCAT('POLYGON((', d    , ' ', p    , ',', 
                                     d    , ' ', p + 1, ',', 
                                     d + 1, ' ', p + 1, ',', 
                                     d + 1, ' ', p    , ',', 
                                     d    , ' ', p    , '))'), ',') 
    + ')', 0) AS shape
FROM Pixels
GROUP BY student;

Figure 3 has the graphical result from the Spatial results tab in SSMS.

Figure 3: Pixelated Schedule

Now you need to decide on the normalization strategy that you want to use. Assuming you decide to use what I referred to earlier as the vertical cuts approach, you can proceed to Step 2. To apply vertical cuts, you pack period ranges per student and date. This is achieved by assigning the group identifier grp_p to each distinct range of consecutive p values per student and d group. Here’s the code that achieves this:

WITH PixelsAndPeriodGroupIDs AS
(
  SELECT S.id, S.student, D.value AS d, P.value AS p,
    P.value - DENSE_RANK() OVER(PARTITION BY S.student, D.value 
                                ORDER BY P.value) AS grp_p
  FROM dbo.Schedule AS S
    CROSS APPLY GENERATE_SERIES(S.fromdate, S.todate) AS D
    CROSS APPLY GENERATE_SERIES(S.fromperiod, S.toperiod) AS P
)
SELECT student, d, MIN(p) AS fromperiod, MAX(p) AS toperiod
FROM PixelsAndPeriodGroupIDs
GROUP BY student, d, grp_p
ORDER BY student, d, grp_p;

Here’s the output of the inner query defining the CTE PixelsAndPeriodGroupIDs, showing the computed grp_p value per pixel, if you will:

id          student d           p           grp_p
----------- ------- ----------- ----------- --------------------
1           Amy     1           7           6
1           Amy     1           8           6
1           Amy     1           9           6
1           Amy     2           7           6
1           Amy     2           8           6
1           Amy     2           9           6
2           Amy     3           5           4
2           Amy     3           6           4
2           Amy     3           7           4
1           Amy     3           7           4
1           Amy     3           8           4
2           Amy     3           8           4
1           Amy     3           9           4
2           Amy     4           5           4
2           Amy     4           6           4
2           Amy     4           7           4
1           Amy     4           7           4
1           Amy     4           8           4
2           Amy     4           8           4
1           Amy     4           9           4
2           Amy     5           5           4
2           Amy     5           6           4
2           Amy     5           7           4
1           Amy     5           7           4
1           Amy     5           8           4
2           Amy     5           8           4
1           Amy     5           9           4
2           Amy     6           5           4
2           Amy     6           6           4
2           Amy     6           7           4
1           Amy     6           7           4
1           Amy     6           8           4
2           Amy     6           8           4
1           Amy     6           9           4
2           Amy     7           5           4
2           Amy     7           6           4
2           Amy     7           7           4
1           Amy     7           7           4
1           Amy     7           8           4
2           Amy     7           8           4
1           Amy     7           9           4
2           Amy     8           5           4
2           Amy     8           6           4
2           Amy     8           7           4
2           Amy     8           8           4
2           Amy     9           5           4
2           Amy     9           6           4
2           Amy     9           7           4
2           Amy     9           8           4
3           Amy     10          1           0
3           Amy     10          2           0
3           Amy     10          3           0
3           Amy     11          1           0
3           Amy     11          2           0
3           Amy     11          3           0
3           Amy     12          1           0
3           Amy     12          2           0
3           Amy     12          3           0
…

And here’s the output of outer query, showing the packed period ranges after grouping:

student d           fromperiod  toperiod
------- ----------- ----------- -----------
Amy     1           7           9
Amy     2           7           9
Amy     3           5           9
Amy     4           5           9
Amy     5           5           9
Amy     6           5           9
Amy     7           5           9
Amy     8           5           8
Amy     9           5           8
Amy     10          1           3
Amy     11          1           3
Amy     12          1           3
Ted     1           11          14
Ted     2           11          14
Ted     3           11          14
Ted     4           11          14
Ted     5           11          14
Ted     7           13          16
Ted     8           13          16
Ted     9           13          16
Ted     10          13          16
Ted     11          13          16

What’s important to note here is that you created packed period ranges per student and date. This is convenient to depict graphically using the following spatial query:

WITH PixelsAndPeriodGroupIDs AS
(
  SELECT S.id, S.student, D.value AS d, P.value AS p,
    P.value - DENSE_RANK() OVER(PARTITION BY S.student, D.value 
                                ORDER BY P.value) AS grp_p
  FROM dbo.Schedule AS S
    CROSS APPLY GENERATE_SERIES(S.fromdate, S.todate) AS D
    CROSS APPLY GENERATE_SERIES(S.fromperiod, S.toperiod) AS P
),
DailyPeriodRanges AS
(
  SELECT student, d, MIN(p) AS fromperiod, MAX(p) AS toperiod
  FROM PixelsAndPeriodGroupIDs
  GROUP BY student, d, grp_p
)
SELECT student,
  GEOMETRY::STGeomFromText('GEOMETRYCOLLECTION('
    + STRING_AGG(CONCAT('POLYGON((',d,' ',fromperiod  , ',', 
                                 d,' ',toperiod + 1,',', 
                                 d + 1,' ',toperiod + 1, ',', 
                                 d + 1,' ',fromperiod ,',',
                                 d,' ',fromperiod,'))'),',') 
    + ')', 0) AS shape
FROM DailyPeriodRanges
GROUP BY student;

The output from the Spatial results tab in SSMS shown in Figure 4.

Figure 4: Daily Period Ranges

What you achieved here is sort of daily binning (again, thanks Paul for the terminology) of the packed period ranges. What’s left then in Step 3, is to pack consecutive date ranges per student and period range. Here’s the code to achieve this:

WITH PixelsAndPeriodGroupIDs AS
(
  SELECT S.id, S.student, D.value AS d, P.value AS p,
    P.value - DENSE_RANK() OVER(PARTITION BY S.student, D.value 
                                ORDER BY P.value) AS grp_p
  FROM dbo.Schedule AS S
    CROSS APPLY GENERATE_SERIES(S.fromdate, S.todate) AS D
    CROSS APPLY GENERATE_SERIES(S.fromperiod, S.toperiod) AS P
),
PeriodRangesAndDateGroupIDs AS
(
  SELECT student, d, MIN(p) AS fromperiod, MAX(p) AS toperiod,
    D - DENSE_RANK() OVER(PARTITION BY student, MIN(p), MAX(p) 
                          ORDER BY d) as grp_d
  FROM PixelsAndPeriodGroupIDs
  GROUP BY student, d, grp_p
)
SELECT student, MIN(d) AS fromdate, MAX(d) AS todate, 
       fromperiod, toperiod
FROM PeriodRangesAndDateGroupIDs
GROUP BY student, fromperiod, toperiod, grp_d
ORDER BY student, fromdate, fromperiod;

The main trick here is that in the second CTE (PeriodRangesAndDateGroupIDs), where you apply the classic islands technique a second time, when computing the dense rank value, you partition it by the student and the period range (student, MIN(p), MAX(p)), and order it by d.

Here’s the output of the second CTE’s inner query:

student d           fromperiod  toperiod    grp_d
------- ----------- ----------- ----------- ----------
Amy     10          1           3           9
Amy     11          1           3           9
Amy     12          1           3           9
Amy     8           5           8           7
Amy     9           5           8           7
Amy     3           5           9           2
Amy     4           5           9           2
Amy     5           5           9           2
Amy     6           5           9           2
Amy     7           5           9           2
Amy     1           7           9           0
Amy     2           7           9           0
Ted     1           11          14          0
Ted     2           11          14          0
Ted     3           11          14          0
Ted     4           11          14          0
Ted     5           11          14          0
Ted     7           13          16          6
Ted     8           13          16          6
Ted     9           13          16          6
Ted     10          13          16          6
Ted     11          13          16          6

And here’s the output of the outer query, showing the final desired result with the normalized schedule:

student fromdate    todate      fromperiod  toperiod
------- ----------- ----------- ----------- -----------
Amy     1           2           7           9
Amy     3           7           5           9
Amy     8           9           5           8
Amy     10          12          1           3
Ted     1           5           11          14
Ted     7           11          13          16

If you wish to illustrate the result graphically, you can use the following helper spatial query:

WITH PixelsAndPeriodGroupIDs AS
(
  SELECT S.id, S.student, D.value AS d, P.value AS p,
    P.value - DENSE_RANK() OVER(PARTITION BY S.student, D.value
                                ORDER BY P.value) AS grp_p
  FROM dbo.Schedule AS S
    CROSS APPLY GENERATE_SERIES(S.fromdate, S.todate) AS D
    CROSS APPLY GENERATE_SERIES(S.fromperiod, S.toperiod) AS P
),
PeriodRangesAndDateGroupIDs AS
(
  SELECT student, d, MIN(p) AS fromperiod, MAX(p) AS toperiod,
    d - DENSE_RANK() OVER(PARTITION BY student, MIN(p), MAX(p) 
                          ORDER BY d) as grp_d
  FROM PixelsAndPeriodGroupIDs
  GROUP BY student, d, grp_p
),
NormalizedSchedule AS
(
  SELECT student, MIN(d) AS fromdate, MAX(d) AS todate, 
        fromperiod, toperiod
  FROM PeriodRangesAndDateGroupIDs
  GROUP BY student, fromperiod, toperiod, grp_d
)
SELECT student,
  GEOMETRY::STGeomFromText('GEOMETRYCOLLECTION('
   + STRING_AGG(CONCAT('POLYGON((',fromdate,' ',fromperiod,',',
                                  fromdate,' ',toperiod+1,',',
                                  todate + 1,' ',toperiod+1,',',
                                  todate + 1,' ',fromperiod,',', 
                                  fromdate  , ' ', fromperiod  ,   
                               '))'), 
     ',') 
+ ')', 0) AS shape
FROM NormalizedSchedule
GROUP BY student;

The output in the Spatial results tab in SSMS is the same as what I’ve shown earlier in Figure 2 as the graphical depiction of the desired result.

Conclusion

Thanks again to Paul Wilcox for introducing the two-dimensional packing challenge. It’s a beautiful puzzler and I enjoyed very much working on it. I hope that’s the case for you too.

You saw how important it is to develop a toolbox of fundamental techniques, such as the classic islands technique, and based on it, the Unpack/Pack technique. Those were essential for solving this more complex challenge. You also saw how useful it can be to think of some tasks in graphical terms, and even utilize T-SQL spatial tools for this purpose.

Happy querying!

 

The post Two-Dimensional Interval Packing Challenge appeared first on Simple Talk.



from Simple Talk https://ift.tt/xyBsQte
via

No comments:

Post a Comment