SQL164 2021年11月每天新用户的次日留存率
思路
-
找出新用户:确定每个用户首次活跃的日期(即新用户)
- 例如101用户在11月1日首次出现
-
处理跨天活跃:考虑用户可能跨天活跃的情况(in_time和out_time不在同一天)
- 例如用户可能在11月1日23:50进入,11月2日00:10离开,则算作两天都活跃
-
计算次日留存:
- 对每个新用户,检查他们首次活跃后的第二天是否仍然活跃
- 使用LEAD窗口函数高效获取用户下一次活跃日期
-
计算留存率:
- 每天的新用户数作为分母
- 第二天仍然活跃的新用户数作为分子
- 两者相除得到留存率,保留2位小数
最终输出2021年11月每天新用户的次日留存率,按日期排序。
代码
WITH
-- 获取每个用户的最早活跃日期作为其首次出现日期
first_occurrence AS (SELECT uid,DATE(MIN(in_time)) AS first_dtFROM tb_user_logGROUP BY uid
),-- 获取每个用户每天活跃的记录(处理跨天情况)
user_active_dates AS (SELECT DISTINCT uid,DATE(in_time) AS active_dateFROM tb_user_logUNIONSELECT DISTINCT uid,DATE(out_time) AS active_dateFROM tb_user_log
),-- 为每个用户按日期排序,并使用LEAD获取下一天的活跃状态
user_activity_sequence AS (SELECT uid,active_date,LEAD(active_date) OVER (PARTITION BY uid ORDER BY active_date) AS next_active_dateFROM user_active_dates
),-- 计算每天的新用户数及其次日留存情况
daily_stats AS (SELECT fo.first_dt AS dt,COUNT(DISTINCT fo.uid) AS new_users,COUNT(DISTINCT CASE WHEN DATEDIFF(uas.next_active_date, fo.first_dt) = 1 THEN fo.uid END) AS retained_usersFROM first_occurrence foLEFT JOIN user_activity_sequence uas ON fo.uid = uas.uid AND fo.first_dt = uas.active_dateWHERE fo.first_dt BETWEEN '2021-11-01' AND '2021-11-30'GROUP BY fo.first_dt
)-- 计算并格式化留存率
SELECT dt,ROUND(IF(new_users = 0, 0, retained_users / new_users), 2) AS uv_left_rate
FROM daily_stats
ORDER BY dt;
逐步展示如何计算2021年11月每天新用户的次日留存率
原始数据表 tb_user_log
id | uid | artical_id | in_time | out_time | sign_in |
---|---|---|---|---|---|
1 | 101 | 0 | 2021-11-01 10:00:00 | 2021-11-01 10:00:42 | 1 |
2 | 102 | 9001 | 2021-11-01 10:00:00 | 2021-11-01 10:00:09 | 0 |
3 | 103 | 9001 | 2021-11-01 10:00:01 | 2021-11-01 10:01:50 | 0 |
4 | 101 | 9002 | 2021-11-02 10:00:09 | 2021-11-02 10:00:28 | 0 |
5 | 103 | 9002 | 2021-11-02 10:00:51 | 2021-11-02 10:00:59 | 0 |
6 | 104 | 9001 | 2021-11-02 10:00:28 | 2021-11-02 10:00:50 | 0 |
7 | 101 | 9003 | 2021-11-03 11:00:55 | 2021-11-03 11:01:24 | 0 |
8 | 104 | 9003 | 2021-11-03 11:00:45 | 2021-11-03 11:00:55 | 0 |
9 | 105 | 9003 | 2021-11-03 11:00:53 | 2021-11-03 11:00:59 | 0 |
10 | 101 | 9002 | 2021-11-04 11:00:55 | 2021-11-04 11:00:59 | 0 |
步骤1:确定每个用户的首次活跃日期
SELECT uid,DATE(MIN(in_time)) AS first_dt
FROM tb_user_log
GROUP BY uid;
结果:
uid | first_dt |
---|---|
101 | 2021-11-01 |
102 | 2021-11-01 |
103 | 2021-11-01 |
104 | 2021-11-02 |
105 | 2021-11-03 |
步骤2:处理跨天情况,获取用户活跃日期
SELECT DISTINCT uid,DATE(in_time) AS active_date
FROM tb_user_log
UNION
SELECT DISTINCT uid,DATE(out_time) AS active_date
FROM tb_user_log;
结果:
uid | active_date |
---|---|
101 | 2021-11-01 |
101 | 2021-11-02 |
101 | 2021-11-03 |
101 | 2021-11-04 |
102 | 2021-11-01 |
103 | 2021-11-01 |
103 | 2021-11-02 |
104 | 2021-11-02 |
104 | 2021-11-03 |
105 | 2021-11-03 |
步骤3:使用LEAD函数获取用户的下一次活跃日期
SELECT uid,active_date,LEAD(active_date) OVER (PARTITION BY uid ORDER BY active_date) AS next_active_date
FROM user_active_dates;
结果:
uid | active_date | next_active_date |
---|---|---|
101 | 2021-11-01 | 2021-11-02 |
101 | 2021-11-02 | 2021-11-03 |
101 | 2021-11-03 | 2021-11-04 |
101 | 2021-11-04 | NULL |
102 | 2021-11-01 | NULL |
103 | 2021-11-01 | 2021-11-02 |
103 | 2021-11-02 | NULL |
104 | 2021-11-02 | 2021-11-03 |
104 | 2021-11-03 | NULL |
105 | 2021-11-03 | NULL |
步骤4:计算每天的新用户次日留存情况
SELECT fo.first_dt AS dt,COUNT(DISTINCT fo.uid) AS new_users,COUNT(DISTINCT CASE WHEN DATEDIFF(uas.next_active_date, fo.first_dt) = 1 THEN fo.uid END) AS retained_users
FROM first_occurrence fo
LEFT JOIN user_activity_sequence uas ON fo.uid = uas.uid AND fo.first_dt = uas.active_date
WHERE fo.first_dt BETWEEN '2021-11-01' AND '2021-11-30'
GROUP BY fo.first_dt;
结果:
dt | new_users | retained_users |
---|---|---|
2021-11-01 | 3 | 2 |
2021-11-02 | 1 | 1 |
2021-11-03 | 1 | 0 |
详细解释一下
这个CTE是计算每日新用户及其次日留存情况的核心部分,详细拆解逻辑:
-
数据来源:
first_occurrence
:包含每个用户的首次活跃日期user_activity_sequence
:包含用户每次活跃日期及下一次活跃日期(使用LEAD计算)
-
连接条件:
LEFT JOIN user_activity_sequence uas ON fo.uid = uas.uid AND fo.first_dt = uas.active_date
- 按用户ID连接
- 只连接用户首次活跃当天的记录(因为我们要计算的是新用户的次日留存)
-
计算字段:
new_users
:每天首次出现的用户数(COUNT DISTINCT)retained_users
:这些新用户中第二天仍然活跃的数量
-
留存判断逻辑:
CASE WHEN DATEDIFF(uas.next_active_date, fo.first_dt) = 1 THEN fo.uid END
- 计算用户首次活跃日期与下一次活跃日期的差值
- 如果差值为1天,则表示用户次日活跃
-
为什么用LEFT JOIN:
- 确保即使新用户第二天不活跃,也会被计入分母(新用户数)
- 不活跃的用户在CASE WHEN中会返回NULL,不会被COUNT计算
示例数据推演
以2021-11-01为例:
- 新用户:101、102、103
- 检查他们的次日活跃情况:
- 101:11-02活跃(符合)
- 102:11-02不活跃
- 103:11-02活跃(符合)
- 结果:3个新用户,2个次日活跃 → 留存率2/3=0.67
这种设计确保了:
- 准确识别新用户
- 正确处理跨天活跃情况
- 精确计算次日留存率
最终结果:计算留存率
SELECT dt,ROUND(IF(new_users = 0, 0, retained_users / new_users), 2) AS uv_left_rate
FROM daily_stats
ORDER BY dt;
最终输出:
dt | uv_left_rate |
---|---|
2021-11-01 | 0.67 |
2021-11-02 | 1.00 |
2021-11-03 | 0.00 |