【pandas小技巧】--日期相关处理
日期处理相关内容之前pandas基础系列中有一篇专门介绍过,本篇补充两个常用的技巧。1. 多列合并为日期
当收集来的数据中,年月日等信息分散在多个列时,往往需要先合并成日期类型,然后才能做分析处理。
合并多列转换为日期类型,可以直接用 to_datetime函数来处理:
import pandas as pd
df = pd.DataFrame(
{
"year": ["2021", "2021",
"2022", "2022", "2022"],
"month": ["1", "3", "4", "4", "6"],
"day": ["10", "20", "4", "4", "1"],
"value": ,
}
)
df["date"] = pd.to_datetime(
df[["year", "month", "day"]]
)
df = df.drop(
columns=["year", "month", "day"]
)
dfhttps://cdn.nlark.com/yuque/0/2023/png/2235414/1684210780653-55ecc842-e05a-454f-8e82-1c442880dfdc.png#averageHue=%23eeeae8&clientId=ue64a40dd-4a49-4&from=paste&height=231&id=u6cedc0f3&originHeight=231&originWidth=556&originalType=binary&ratio=1&rotation=0&showTitle=false&size=40086&status=done&style=stroke&taskId=u708e141a-4f27-42aa-b9b3-507387d5c85&title=&width=556
2. 基于日期的聚合统计
之所以要把列类型转换为日期类型,是因为pandas提供了针对日期类型的非常便利的聚合统计方法。
比如如下连续的日期数据:
df = pd.DataFrame(
{
"year": ["2022", "2022", "2023",
"2023", "2023"],
"month": ["12", "12", "1", "1", "1"],
"day": ["30", "31", "1", "1", "2"],
"value": ,
}
)
df["date"] = pd.to_datetime(
df[["year", "month", "day"]]
)
df = df.drop(
columns=["year", "month", "day"]
)
df = df.loc[:, ::-1]
dfhttps://cdn.nlark.com/yuque/0/2023/png/2235414/1684212067329-f8be2ccf-32d4-4197-8bad-4dc708a98778.png#averageHue=%23f0eae9&clientId=u57afd2c2-ec55-4&from=paste&height=258&id=uec64e839&originHeight=258&originWidth=551&originalType=binary&ratio=1&rotation=0&showTitle=false&size=39466&status=done&style=stroke&taskId=ue52522e6-e59c-4fdd-bac6-4c713283064&title=&width=551
这里用了之前介绍过的一个小技巧 df.loc[:, ::-1],把date列放在value列之前,对数据处理没有什么影响,只是为了看数据的习惯。
得到转换好的数据之后,可以通过resample函数来聚合统计。
resample是pandas提供的专门用于时间序列数据的聚合统计的。
2.1. 按年统计
ysum = df.resample("Y", on="date").value.sum()
ymean = df.resample("Y", on="date").value.mean()
stat = pd.DataFrame({
"sum": ysum,
"mean": ymean,
})
stathttps://cdn.nlark.com/yuque/0/2023/png/2235414/1684212989019-154ab4c5-381a-4131-b99f-4f60734388d0.png#averageHue=%23edebeb&clientId=u57afd2c2-ec55-4&from=paste&height=214&id=ud410050d&originHeight=214&originWidth=499&originalType=binary&ratio=1&rotation=0&showTitle=false&size=31050&status=done&style=stroke&taskId=u83f32fd2-7539-4b88-8ec4-0a83e9fd2b7&title=&width=499
示例数据只有两年的,统计后显示的是日期是年末最后一天。
这里为了演示只统计了合计值和平均值,实际可以根据情况统计需要的值。
2.2. 按月统计
msum = df.resample("M", on="date").value.sum()
mmean = df.resample("M", on="date").value.mean()
stat = pd.DataFrame({
"sum": msum,
"mean": mmean,
})
stathttps://cdn.nlark.com/yuque/0/2023/png/2235414/1684213018919-0930b01c-1d82-4b1c-b1fb-d444ad8ee561.png#averageHue=%23edebeb&clientId=u57afd2c2-ec55-4&from=paste&height=213&id=u129afffc&originHeight=213&originWidth=509&originalType=binary&ratio=1&rotation=0&showTitle=false&size=31122&status=done&style=stroke&taskId=u5068ce3d-d9c7-4380-bcb2-ff5698ede6a&title=&width=509
统计后显示的日期是每个月月末的日期。
2.3. 按日统计
dsum = df.resample("D", on="date").value.sum()
dmean = df.resample("D", on="date").value.mean()
stat = pd.DataFrame({
"sum": dsum,
"mean": dmean,
})
stathttps://cdn.nlark.com/yuque/0/2023/png/2235414/1684214311651-100ef9fe-771d-4cf1-b64c-ddfaae8ecf46.png#averageHue=%23edeaea&clientId=u57afd2c2-ec55-4&from=paste&height=239&id=u48363317&originHeight=239&originWidth=507&originalType=binary&ratio=1&rotation=0&showTitle=false&size=37989&status=done&style=stroke&taskId=u8fd161ef-3545-4bd8-a366-b3f2c96d3ea&title=&width=507
根据每天的日期统计。
2.4. 按季度统计
qsum = df.resample("Q", on="date").value.sum()
qmean = df.resample("Q", on="date").value.mean()
stat = pd.DataFrame({
"sum": qsum,
"mean": qmean,
})
stathttps://cdn.nlark.com/yuque/0/2023/png/2235414/1684216362412-9112b82b-f47d-4448-9d77-80797a7eba66.png#averageHue=%23eeeceb&clientId=u57afd2c2-ec55-4&from=paste&height=219&id=u51b0dc36&originHeight=219&originWidth=501&originalType=binary&ratio=1&rotation=0&showTitle=false&size=31554&status=done&style=stroke&taskId=u2c5cd89a-5785-4ae7-aa66-9169e691876&title=&width=501
统计后显示的日期是每个季度的最后一天。
2.5. 按周统计
wsum = df.resample("W", on="date").value.sum()
wmean = df.resample("W", on="date").value.mean()
stat = pd.DataFrame({
"sum": wsum,
"mean": wmean,
})
stathttps://cdn.nlark.com/yuque/0/2023/png/2235414/1684216449759-77744c64-3d5f-4714-bbf2-9e725ba387fb.png#averageHue=%23edebeb&clientId=u57afd2c2-ec55-4&from=paste&height=213&id=uddf1150c&originHeight=213&originWidth=501&originalType=binary&ratio=1&rotation=0&showTitle=false&size=31614&status=done&style=stroke&taskId=u2fad8c5b-4029-427e-b7fa-1e46a0b3cc4&title=&width=501
统计后显示的日期是每个周的周日。
2.6. 补充
resample函数支持的统计期间除了上面介绍的常用的年,月,日,周,季度等等,还有很多其他的期间,
具体参考:https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases
https://cdn.nlark.com/yuque/0/2023/png/2235414/1684216633984-49edf30b-9e0b-4f0e-ac56-80e775f0b9d9.png#averageHue=%23fdfcfc&clientId=u57afd2c2-ec55-4&from=paste&height=1385&id=u4fb0e5fb&originHeight=1385&originWidth=581&originalType=binary&ratio=1&rotation=0&showTitle=false&size=53989&status=done&style=stroke&taskId=ue9cd29aa-c533-4b63-a6e8-41ae87cc19b&title=&width=581
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!
页:
[1]