MySQL 8.0的新特性2——窗口函数

窗口函数是MySQL 8.0引入的一项重要功能,它允许用户在查询结果集的特定窗口内执行计算,而不需要使用GROUP BY语句对结果进行分组。窗口函数为数据分析和报表生成提供了强大的支持,可以实现排名、聚合、偏移等复杂计算。本章将详细介绍MySQL 8.0中的窗口函数及其用法。

窗口函数概述

窗口函数是一种特殊类型的函数,它在一组与当前行相关的行(称为"窗口")上执行计算。与聚合函数不同,窗口函数不会将多行数据合并为一行,而是为每一行返回一个结果。

窗口函数的基本语法如下:

window_function_name(expression) OVER (
    [PARTITION BY partition_expression, ...]
    [ORDER BY sort_expression [ASC|DESC], ...]
    [frame_clause]
)
  • window_function_name:窗口函数的名称,如ROW_NUMBER、RANK、SUM等
  • expression:函数参数(某些窗口函数可能不需要参数)
  • PARTITION BY:可选,用于将结果集划分为不同的分区,窗口函数将在每个分区内独立计算
  • ORDER BY:可选,用于在分区内对行进行排序,决定窗口函数的计算顺序
  • frame_clause:可选,用于定义当前分区内的窗口大小(即与当前行相关的行的集合)

排名窗口函数

排名窗口函数用于为结果集中的行分配排名。MySQL 8.0提供了以下排名窗口函数:

函数 描述 特点
ROW_NUMBER() 为每一行分配一个唯一的序号,从1开始 无论值是否相同,序号都连续
RANK() 为每一行分配排名,相同值的行获得相同的排名 相同值的行排名相同,下一组的排名会跳跃
DENSE_RANK() 为每一行分配排名,相同值的行获得相同的排名 相同值的行排名相同,下一组的排名不会跳跃
NTILE(n) 将结果集划分为n个大致相等的组,并为每行分配所属组的编号 用于将数据均匀分组

排名窗口函数示例

假设有一个学生成绩表:

CREATE TABLE student_scores (
    student_id INT,
    student_name VARCHAR(50),
    subject VARCHAR(50),
    score INT
);

INSERT INTO student_scores VALUES
(1, '张三', '数学', 85),
(1, '张三', '语文', 92),
(1, '张三', '英语', 78),
(2, '李四', '数学', 92),
(2, '李四', '语文', 88),
(2, '李四', '英语', 90),
(3, '王五', '数学', 78),
(3, '王五', '语文', 80),
(3, '王五', '英语', 85),
(4, '赵六', '数学', 92),
(4, '赵六', '语文', 85),
(4, '赵六', '英语', 88);

使用排名窗口函数:

-- 按科目对学生成绩进行排名
SELECT 
    student_name, 
    subject, 
    score,
    ROW_NUMBER() OVER (PARTITION BY subject ORDER BY score DESC) AS row_num,
    RANK() OVER (PARTITION BY subject ORDER BY score DESC) AS rank_num,
    DENSE_RANK() OVER (PARTITION BY subject ORDER BY score DESC) AS dense_rank_num,
    NTILE(3) OVER (PARTITION BY subject ORDER BY score DESC) AS ntile_num
FROM student_scores;

-- 计算每个学生的总分,并按总分排名
SELECT 
    student_id,
    student_name,
    SUM(score) AS total_score,
    RANK() OVER (ORDER BY SUM(score) DESC) AS rank_by_total
FROM student_scores
GROUP BY student_id, student_name
ORDER BY rank_by_total;

-- 在每个科目中,将学生成绩分为三个等级
SELECT 
    student_name,
    subject,
    score,
    NTILE(3) OVER (PARTITION BY subject ORDER BY score DESC) AS grade_level,
    CASE NTILE(3) OVER (PARTITION BY subject ORDER BY score DESC)
        WHEN 1 THEN '优秀'
        WHEN 2 THEN '良好'
        WHEN 3 THEN '及格'
    END AS grade_desc
FROM student_scores;

聚合窗口函数

聚合窗口函数是将聚合函数(如SUM、AVG、MIN、MAX等)用作窗口函数。与普通聚合函数不同,聚合窗口函数不会合并行,而是为每一行返回一个基于窗口的聚合结果。

聚合窗口函数示例

-- 计算每个学生的总分,以及该学生在所有学生中的总分占比
SELECT 
    student_id,
    student_name,
    SUM(score) AS total_score,
    SUM(score) OVER () AS overall_total,
    CONCAT(ROUND(SUM(score) / SUM(score) OVER () * 100, 2), '%') AS percentage
FROM student_scores
GROUP BY student_id, student_name;

-- 按科目计算每个学生的成绩,以及该科目的平均分、最高分和最低分
SELECT 
    student_name,
    subject,
    score,
    AVG(score) OVER (PARTITION BY subject) AS avg_score,
    MAX(score) OVER (PARTITION BY subject) AS max_score,
    MIN(score) OVER (PARTITION BY subject) AS min_score
FROM student_scores;

-- 计算每个学生的累计成绩
SELECT 
    student_id,
    student_name,
    subject,
    score,
    SUM(score) OVER (PARTITION BY student_id ORDER BY subject) AS cumulative_score
FROM student_scores;

-- 计算移动平均值(每个学生前一次和当前成绩的平均值)
SELECT 
    student_id,
    student_name,
    subject,
    score,
    AVG(score) OVER (
        PARTITION BY student_id 
        ORDER BY subject 
        ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
    ) AS moving_avg
FROM student_scores;

偏移窗口函数

偏移窗口函数用于访问结果集中相对于当前行的其他行的数据。MySQL 8.0提供了以下偏移窗口函数:

函数 描述 示例
LAG(expr [, offset [, default]]) 返回当前行之前offset行的expr值 LAG(score, 1, 0) OVER (ORDER BY score) 返回前一行的score值
LEAD(expr [, offset [, default]]) 返回当前行之后offset行的expr值 LEAD(score, 1, 0) OVER (ORDER BY score) 返回后一行的score值
FIRST_VALUE(expr) 返回窗口中的第一行的expr值 FIRST_VALUE(score) OVER (ORDER BY score) 返回窗口中的第一个score值
LAST_VALUE(expr) 返回窗口中的最后一行的expr值 LAST_VALUE(score) OVER (ORDER BY score) 返回窗口中的最后一个score值

偏移窗口函数示例

-- 计算每个学生相邻两次考试的成绩变化
-- 首先,我们需要有一个有序的考试记录表
CREATE TABLE exam_records (
    student_id INT,
    exam_date DATE,
    subject VARCHAR(50),
    score INT
);

INSERT INTO exam_records VALUES
(1, '2023-09-01', '数学', 85),
(1, '2023-10-01', '数学', 88),
(1, '2023-11-01', '数学', 92),
(2, '2023-09-01', '数学', 90),
(2, '2023-10-01', '数学', 87),
(2, '2023-11-01', '数学', 94);

-- 使用LAG函数计算成绩变化
SELECT 
    student_id,
    exam_date,
    score,
    LAG(score) OVER (PARTITION BY student_id ORDER BY exam_date) AS previous_score,
    score - LAG(score) OVER (PARTITION BY student_id ORDER BY exam_date) AS score_change
FROM exam_records;

-- 使用LEAD函数预测下一次考试的可能成绩
SELECT 
    student_id,
    exam_date,
    score,
    LEAD(score) OVER (PARTITION BY student_id ORDER BY exam_date) AS next_score
FROM exam_records;

-- 找出每个学生的最高分和最低分
SELECT 
    student_id,
    exam_date,
    score,
    FIRST_VALUE(score) OVER (PARTITION BY student_id ORDER BY score DESC) AS max_score,
    FIRST_VALUE(score) OVER (PARTITION BY student_id ORDER BY score ASC) AS min_score
FROM exam_records;

-- 计算与最高分的差距
SELECT 
    student_id,
    exam_date,
    score,
    MAX(score) OVER (PARTITION BY student_id) AS highest_score,
    MAX(score) OVER (PARTITION BY student_id) - score AS score_gap
FROM exam_records;

窗口框架(Frame)

窗口框架是窗口函数中的一个重要概念,它定义了在分区内计算窗口函数的行集合。窗口框架可以通过frame_clause来指定,frame_clause的语法如下:

[ROWS | RANGE] frame_start [frame_between]

其中,frame_start可以是以下值之一:

  • UNBOUNDED PRECEDING:分区的第一行
  • N PRECEDING:当前行之前的N行
  • CURRENT ROW:当前行

frame_between可以是以下值之一:

  • AND UNBOUNDED FOLLOWING:分区的最后一行
  • AND N FOLLOWING:当前行之后的N行
  • AND CURRENT ROW:当前行

ROWS和RANGE的区别:

  • ROWS:基于行的物理位置来确定窗口
  • RANGE:基于行的值来确定窗口(对于相同值的行,会被包含在同一个窗口中)

窗口框架示例

-- 使用不同的窗口框架计算移动平均值
SELECT 
    student_id,
    exam_date,
    score,
    -- 从分区开始到当前行的平均值
    AVG(score) OVER (
        PARTITION BY student_id 
        ORDER BY exam_date 
        ROWS UNBOUNDED PRECEDING
    ) AS cumulative_avg,
    -- 当前行和前一行的平均值(移动平均)
    AVG(score) OVER (
        PARTITION BY student_id 
        ORDER BY exam_date 
        ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
    ) AS moving_avg_2rows,
    -- 当前行和前后各一行的平均值(3行移动平均)
    AVG(score) OVER (
        PARTITION BY student_id 
        ORDER BY exam_date 
        ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
    ) AS moving_avg_3rows
FROM exam_records;

-- 计算累积和,使用RANGE框架(对于相同值的行,会被一起计算)
SELECT 
    score,
    COUNT(*) OVER (
        ORDER BY score 
        RANGE UNBOUNDED PRECEDING
    ) AS cumulative_count,
    SUM(score) OVER (
        ORDER BY score 
        RANGE UNBOUNDED PRECEDING
    ) AS cumulative_sum
FROM student_scores;

-- 使用窗口框架计算市场份额变化
-- 假设我们有一个销售数据表
CREATE TABLE sales_data (
    month DATE,
    product_id INT,
    sales_amount DECIMAL(10,2)
);

INSERT INTO sales_data VALUES
('2023-01-01', 1, 1000),
('2023-01-01', 2, 1500),
('2023-01-01', 3, 500),
('2023-02-01', 1, 1200),
('2023-02-01', 2, 1300),
('2023-02-01', 3, 800),
('2023-03-01', 1, 1400),
('2023-03-01', 2, 1100),
('2023-03-01', 3, 900);

-- 计算每个产品的月度销售额及其在月度总销售额中的占比
SELECT 
    month,
    product_id,
    sales_amount,
    SUM(sales_amount) OVER (PARTITION BY month) AS monthly_total,
    CONCAT(ROUND(sales_amount / SUM(sales_amount) OVER (PARTITION BY month) * 100, 2), '%') AS market_share
FROM sales_data;

-- 计算每个产品的销售额累计占比(帕累托分析)
SELECT 
    product_id,
    SUM(sales_amount) AS total_sales,
    SUM(SUM(sales_amount)) OVER (ORDER BY SUM(sales_amount) DESC) AS cumulative_sales,
    SUM(SUM(sales_amount)) OVER () AS grand_total,
    CONCAT(
        ROUND(
            SUM(SUM(sales_amount)) OVER (ORDER BY SUM(sales_amount) DESC) / 
            SUM(SUM(sales_amount)) OVER () * 100, 
            2
        ), 
        '%'
    ) AS cumulative_percentage
FROM sales_data
GROUP BY product_id
ORDER BY total_sales DESC;

窗口函数的高级应用

窗口函数可以与其他SQL功能结合使用,实现更复杂的数据分析需求:

高级应用示例

-- 计算连续登录的天数
-- 假设我们有一个用户登录记录表
CREATE TABLE user_logins (
    user_id INT,
    login_date DATE
);

INSERT INTO user_logins VALUES
(1, '2023-10-01'),
(1, '2023-10-02'),
(1, '2023-10-03'),
(1, '2023-10-05'),
(1, '2023-10-06'),
(2, '2023-10-01'),
(2, '2023-10-03'),
(2, '2023-10-04');

-- 计算每个用户的连续登录天数
WITH consecutive_logins AS (
    SELECT 
        user_id,
        login_date,
        login_date - ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) AS grp
    FROM user_logins
)
SELECT 
    user_id,
    MIN(login_date) AS start_date,
    MAX(login_date) AS end_date,
    COUNT(*) AS consecutive_days
FROM consecutive_logins
GROUP BY user_id, grp
ORDER BY user_id, start_date;

-- 计算每个学生的成绩排名变化
-- 假设我们有两次考试的成绩数据
CREATE TABLE exam_comparison (
    student_id INT,
    student_name VARCHAR(50),
    exam_type VARCHAR(20), -- 'midterm' 或 'final'
    score INT
);

INSERT INTO exam_comparison VALUES
(1, '张三', 'midterm', 85),
(1, '张三', 'final', 92),
(2, '李四', 'midterm', 90),
(2, '李四', 'final', 87),
(3, '王五', 'midterm', 78),
(3, '王五', 'final', 85),
(4, '赵六', 'midterm', 88),
(4, '赵六', 'final', 94);

-- 计算期中考试和期末考试的排名,以及排名变化
WITH ranked_exams AS (
    SELECT 
        student_id,
        student_name,
        exam_type,
        score,
        RANK() OVER (PARTITION BY exam_type ORDER BY score DESC) AS exam_rank
    FROM exam_comparison
)
SELECT 
    r1.student_id,
    r1.student_name,
    r1.score AS midterm_score,
    r1.exam_rank AS midterm_rank,
    r2.score AS final_score,
    r2.exam_rank AS final_rank,
    r1.exam_rank - r2.exam_rank AS rank_improvement
FROM ranked_exams r1
JOIN ranked_exams r2 
    ON r1.student_id = r2.student_id 
    AND r1.exam_type = 'midterm' 
    AND r2.exam_type = 'final'
ORDER BY rank_improvement DESC;

-- 使用窗口函数进行数据去重
-- 假设我们有一个包含重复记录的表,我们想要保留每个用户的最新记录
CREATE TABLE user_activity (
    id INT AUTO_INCREMENT PRIMARY KEY,
    user_id INT,
    activity_time DATETIME,
    activity_type VARCHAR(50),
    details VARCHAR(100)
);

INSERT INTO user_activity (user_id, activity_time, activity_type, details) VALUES
(1, '2023-10-15 10:00:00', 'login', '用户登录'),
(1, '2023-10-15 10:30:00', 'update_profile', '更新个人资料'),
(1, '2023-10-15 11:00:00', 'logout', '用户登出'),
(2, '2023-10-15 09:00:00', 'login', '用户登录'),
(2, '2023-10-15 10:00:00', 'purchase', '购买商品'),
(2, '2023-10-15 11:30:00', 'logout', '用户登出');

-- 获取每个用户的最新活动记录
SELECT *
FROM (
    SELECT 
        *, 
        ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY activity_time DESC) AS row_num
    FROM user_activity
) AS latest_activities
WHERE row_num = 1;

性能优化建议:使用窗口函数时,可以考虑以下性能优化策略:

  1. 合理使用PARTITION BY子句,将数据划分为适当的分区,减少每个窗口的大小
  2. 仅在需要时使用ORDER BY子句,因为排序操作会增加查询的开销
  3. 对于大型数据集,可以考虑创建适当的索引来加速窗口函数的计算
  4. 注意窗口框架的选择,不同的窗口框架会影响查询的性能和结果
  5. 如果可能,尽量使用ROWS而不是RANGE,因为ROWS的计算通常更快