百分位数:如果将一组数据从小到大排序,并计算相应的累计百分位,则某一百分位所对应数据的值就称为这一百分位的百分位数。可表示为:一组n个观测值按数值大小排列。如,处于p%位置的值称第p百分位数。
下面给出3种计算方式:
1. PERCENT_RANK() OVER(ORDER BY .....)
返回某列或某列组合后每行的百分比排序,返回值在0~1之间,使用此函数可以直接得出百分位数
2. RANK() OVER(ORDER BY .....)
使用rank()函数可以统计出当前行的排名,配合总数即可算出百分位数,总数使用COUNT(1) OVER() 即可得出
3. COUNT(1) OVER(ORDER BY ..... RANGE BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)
手动调整窗口范围,确认当前行的排名,配合总数即可算出百分位数,总数使用COUNT(1) OVER() 即可得出
下面将举例给出具体使用方法
举例场景:计算学生成绩的百分位数
注:本次测试在oracle环境下完成,不过使用到的函数绝大部分数据库都支持,大家有兴趣的话可以尝试一下其他数据库
创建学生成绩表:
CREATE TABLE TEST.STUDENT_SCORE(
name varchar(20), --学生姓名
course varchar(20), --科目
score NUMBER(5,2) --成绩
);
写入测试数据:
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('张三','政治',90.5);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('李四','政治',79.0);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('王五','政治',85.5);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('赵六','政治',93.0);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('小明','政治',92.5);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('小红','政治',88.0);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('小吕','政治',76.5);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('小高','政治',93.0);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('张三','外语',87.0);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('李四','外语',92.0);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('王五','外语',69.5);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('赵六','外语',76.0);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('小高','外语',76.0);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('张三','高数',95.0);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('李四','高数',70.5);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('王五','高数',65.0);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('赵六','高数',88.5);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('张三','算法',59.5);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('张三','数据结构',99.5);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('李四','数据结构',89.0);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('王五','数据结构',69.5);
INSERT INTO TEST.STUDENT_SCORE (name,course,score) VALUES ('赵六','数据结构',90.5);
1.使用 PERCENT_RANK() OVER(ORDER BY .....) 计算各个科目的百分位数:
--用法非常简单,此处将百分位数乘100,使百分位数在0~100之间
SELECt
name 姓名,
course 科目,
score 成绩,
ROUND(PERCENT_RANK() OVER(PARTITION BY COURSE ORDER BY SCORE DESC)*100, 2) 百分位数
FROM TEST.STUDENT_SCORE ;
结果:
2.使用 RANK() OVER(ORDER BY .....) 计算各个科目的百分位数:
--这种写法使用总人数和排名来计算百分位,复杂一些,但是算法可以自己修改
SELECt
name 姓名,
course 科目,
score 成绩,
score_rank 排名,
students 总人数,
CASE WHEN students > 1
THEN ROUND(score_rank * 100 / (students - 1), 2)
ELSE 0
END 百分位数
FROM (
SELECt
name,
course,
score,
RANK() over(PARTITION BY course ORDER BY score DESC)-1 score_rank, --当前行的排名
count(1) over(PARTITION BY course) students --当前科目的总人数
FROM TEST.STUDENT_SCORE
);
结果(排名从0开始):
3.使用COUNT(1) OVER(ORDER BY ..... RANGE BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) 计算各个科目的百分位数:
--这种写法更复杂一些,结果与前面的是一样的,可以修改的地方更多一些
SELECt
name 姓名,
course 科目,
score 成绩,
score_rank 排名,
students 总人数,
CASE WHEN students > 1
THEN ROUND(score_rank * 100 / (students - 1), 2)
ELSE 0
END 百分位数
FROM (
SELECt
name,
course,
score,
count(1) over(PARTITION BY course ORDER BY score RANGE BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) score_rank, --当前行的排名
count(1) over(PARTITION BY course) students --当前科目的总人数
FROM TEST.STUDENT_SCORE
);
结果(排名从0开始):
扩展知识:
--OVER()函数的子句,可以选择窗口的范围,需要配合ORDER BY子句使用
ROWS|RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW --从第一行到当前行
ROWS|RANGE BETWEEN CURRENT ROW AND n FOLLOWING --从当前行到随后的n行
ROWS|RANGE BETWEEN n PRECEDING AND CURRENT ROW --从前n行到当前行
ROWS|RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING --从当前行到结尾行
ROWS|RANGE BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING --从下一行到结尾行
ROWS、RANGE的区别:
ROWS是以行来划分窗口范围
RANGE会以实际值来划分窗口范围
举例:
--从下一行到结尾行
SELECt
name,
course,
score,
count(1) over(PARTITION BY course ORDER BY score RANGE BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) RANK1,
count(1) over(PARTITION BY course ORDER BY score ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) RANK2
FROM TEST.STUDENT_SCORE ;
结果:
由上述结果就可以看出,取下一行时:
ROWS关键字是按照ORDER BY的排序取的下一行到最后一行
RANGE关键字会跳过相同的值,从下一个值开始取,类似于rank排序的算法
注:计算百分位数时,相同值的百分位数也一样,所以第三种方式使用RANGE关键字实现