The SQL NTILE()
is a window function that allows you to break a table into a specified number of approximately equal groups, or <bucket count>
. For each row in a grouping, the NTILE()
function assigns a bucket number representing the group to which the row belong starting at one.
The syntax of the NTILE()
function is:
NTILE(<bucket count>) OVER (PARTITION BY <expression list> ORDER BY <sort parameter list>)
The <bucket count>
is a literal positive integer or an expression that evaluates to a positive integer. Most of the time is an integer constant but using the option of an expression can be a handy trick.
PARTITION BY
The PARTITION BY
clause divides the result set returned from the FROM
clause into partitions to which the NTILE()
function is applied. This is the way this subclause is used inside the OVER ()
clause in other windowed functions.
ORDER BY
The ORDER BY
clause specifies the order of rows in each partition to which the NTILE()
is applied. Each parameter in the list can have an optional sort order attached to it with the optional [ASC
| DESC
] postfix. This is the way this subclause is used inside the OVER ()
In other windowed functions. This ordering is how The rows of the table are scanned to pick up groups.
If the number of rows in the results at is divisible by <bucket count>
, the rows are divided evenly among groups. But if the number of rows is not divisible by <bucket count>
, the NTILE()
function results in groups of two sizes. The larger group always come before the smaller groups in the order specified by the ORDER BY
clause.
Easier to See With an Example
Let’s create some sample data And see how this works. For example, we’re going to do a pretty small table, but it’s worth mentioning that when you’re doing an NTILE
problem in the real world, you generally want a large population. Small sample sizes lead to badly sized groups.
The following statement creates a new table named population that stores 10 integers from one to ten:
BEGIN CREATE TABLE Population (pop_id INTEGER NOT NULL PRIMARY KEY); INSERT INTO Population(pop_id) VALUES(1),(2),(3),(4),(5),(6),(7),(8),(9),(10); END;
As a simple example, let’s use two for the bucket count:
SELECT pop_id, (NTILE (2) OVER (ORDER BY pop_id)) AS grouped_pop FROM Population;
This returns two groups of data as you can see in the grouped_pop
column.
pop_id grouped_pop
----------- --------------------
1 1
2 1
3 1
4 1
5 1
6 2
7 2
8 2
9 2
10 2
What NTILE is NOT
The first thing we need to get out of the way, is to discuss what this function is not. Too often programmers who use it for the first-time attribute properties to which it doesn’t have.
Not a Histogram
The NTILE()
function attempts to make all the buckets exactly the same size, but that’s not always possible with a given population. People often mistake this function for histogram, in which we might want buckets of different sizes. We know we have only a few billionaires in our population, and we know we have a big middle class and a relatively small number of truly poor people.
Another problem is that the attribute upon which you’re basing these buckets might not be very distinct. At the extreme, imagine that every employee has exactly the same salary amount. None of the NTILE()
groups based on salary can be expected to be better than any other group.
Not a Median
SQL does not have a built-in median aggregate function. Decades ago there were articles in trade magazines, Database Programming And Design and DBMS on how to write them in standard SQL (and back in 2009, I wrote this article on the subject here on Simple-Talk). We had various clever solutions, and it was a fun problem that bounced back and forth between the two magazines.
It’s very tempting to use the NTILE()
function to compute the median by finding the highest value in the first partition, the lowest value in the second partition, adding them and dividing by two. With the first 10 row example, we get (5 +6) / 2.0 = 5.5
for the median. Using this method. I’ll leave it to the reader to write this algorithm in one SQL statement.
Unfortunately, this does just does not work. There is no guarantee the first and second partitions will be the same size as extra values will be thrown into the first partition. If population wasn’t evenly divisible by two. It gets even worse. if you try dividing the population into three groups, the tops and bottoms are not the same size. The first group has four rows while the other groups have three rows. Ten does not divide evenly by three without a remainder, so NTILE(3)
would return the following in our previous query.
pop_id grouped_pop
----------- --------------------
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 3
9 3
10 3
Not Exactly a Partition
A set is partitioned when all of the subsets in the partitioning union back into the original set, and the intersection of all of those subsets is empty. The first thing we need to do is to guarantee the value were putting into these groups is unique. That’s easy enough to do with a SELECT DISTINCT
operation. Then add a group number to those unique rows. Finally, find the minimum and maximum value within each group. This will give you a partitioning of the original data but remember that you’ve taken the samples based on the bucket size. There are a lot of assumptions being made here.
SELECT salary_grp, MIN(salary_amt) AS range_start, MAX(salary_amt) AS range_finish FROM (SELECT X.salary_amt, NTILE(5) OVER (ORDER BY X.salary_amt) AS salary_grp FROM (SELECT DISTINCT salary_amt FROM Personnel)) AS X(salary_amt) GROUP BY X. grp_nbr;
PARTITION BY Clause
The Windowing functions in SQL pick a particular column or expression to reset their calculations. In effect, they repeat the actions of the rest of the function of the subsets formed by the partition expression. This is one of those things that, again, Is easier to show than to say.
Let’s imagine you got a table of candidates for Mensa, the high IQ society. To get membership, you have to have an IQ in the top 2% of the population. There is a list of acceptable IQ test that you can submit. The most common IQ tests are:
- Stanford-Binet Intelligence Scale.
- Universal Nonverbal Intelligence.
- Differential Ability Scales.
- Peabody Individual Achievement Test.
- Wechsler Individual Achievement Test.
- Wechsler Adult Intelligence Scale.
- Woodcock Johnson III Tests of Cognitive Disabilities.
Candidates just must score high on one test to get membership But to play safe, candidates will submit more than one test score. Let’s say that our table looks like this (it references a table named IQ_Tests
that we will not create):
CREATE TABLE Mensa_Candidates ( candidate_name CHAR(25) NOT NULL, test_name CHAR(25) NOT NULL -- not including related tables in this example -- REFERENCES IQ_Tests (test_name), PRIMARY KEY (candidate_name, test_name), test_score INTEGER NOT NULL CHECK (test_score >= 0) );
Then I will create a bit of data to demo with:
INSERT INTO dbo.Mensa_Candidates ( candidate_name, test_name,test_score) VALUES ('Person 1','Test A', 100),('Person 9','Test A', 300), ('Person 1','Test B', 100),('Person 9','Test B', 130), ('Person 2','Test A', 130),('Person 10','Test A', 600), ('Person 2','Test B', 200),('Person 10','Test B', 200), ('Person 3','Test A', 300),('Person 11','Test A', 300), ('Person 3','Test B', 120),('Person 12','Test A', 440), ('Person 4','Test B', 133),('Person 13','Test B', 150), ('Person 5','Test A', 400),('Person 14','Test A', 320), ('Person 5','Test B', 100),('Person 14','Test B', 400), ('Person 6','Test A', 300),('Person 15','Test A', 300), ('Person 7','Test A', 130),('Person 15','Test B', 500), ('Person 8','Test B', 104),('Person 16','Test A', 600)
You could analyze candidates test scores by using NTILE()
in the following manner:
SELECT candidate_name, test_name, test_score, NTILE(10) OVER (PARTITION BY test_name ORDER BY test_score) AS test_ranking FROM Mensa_Candidates WHERE test_ranking = 10;
You can see from the output that
candidate_name test_name test_score test_ranking
--------------- ----------- ----------- --------------
Person 16 Test A 600 10
Person 2 Test B 800 10
This says were going to partition our data by the test names, in increasing order by test score, so we will get the best 10 percent of candidates for each test.
If a candidate was in more than one upper decile, then their name will appear more than once in the result set, once with each of the tests they took. We could then group the stable and find out how many people are qualified on more than one test. You can see this in the test data for Person 1 who falls in the lower 10 percent for each test:
Change the test_ranking
criteria to 1, and you will see the bottom 10 percent.
candidate_name test_name test_score test_ranking
--------------- ----------- ----------- --------------
Person 1 Test A 100 1
Person 2 Test A 130 1
Person 5 Test B 100 1
Person 8 Test B 104 1
Conclusions
While the NTILE()
function is not a complete statistical package in itself, you can quickly use it to explore your data without leaving SQL. The other window functions can also be quite useful. And it’s definitely worth taking a few days with some simple sample data to play with them and learn how to use them.
The post The NTILE Function appeared first on Simple Talk.
from Simple Talk https://ift.tt/Em35Ubs
via
No comments:
Post a Comment