【大数据】—美国交通事故分析（2016 年 2 月至 2020 年 12 月） ...

欢乐狗 · 2024-7-11 15:25:00

引言

在当今快速发展的数字时代，大数据已成为我们理解世界、做出决策的重要工具。特别是在交通安全范畴，大数据分析能够揭示事故模式、辨认风险因素，并资助订定防备措施，从而挽救生命。本文将深入探讨2016年2月至2020年12月期间，美国交通事故的大数据集，旨在通过数据分析揭示交通事故的内涵规律和趋势。
配景

这是一个美国天下性的车祸数据集，涵盖美国 49 个州。事故数据是在 2016 年 2 月至 2020 年 12 月期间网络的，利用多个提供流式交通事件（或事件）数据的 API。这些 API 由各种实体捕捉的交通数据，例如美国和州交通部门、执法机构、交通摄像头和门路网络内的交通传感器获取。目前，该数据会合约有 773万条事故记录。
目的

从Excel 2007开始，读取行数上限增加到了1,048,576行，超出的行数就不能被打开了，工作中目前能碰到的也就1万多不到2万条数据。今天用下图数据集测试家用电脑的数据承载能力，3.06G相当大了，工作中基本是遇不到这么大的数据。

数据集信息

ID: 事故记录的唯一标识符。
Severity: 事故严峻水平，从1到4的数字，1表示对交通影响最小，4表示影响最大。
Start_Time 和 End_Time: 事故开始和竣事时间，以当地时间表示。
Start_Lat 和 Start_Lng: 事故开始点的经纬度坐标。
End_Lat 和 End_Lng: 事故影响竣事点的经纬度坐标，可能为空。
Distance(mi): 受事故影响的门路长度。
Description: 事故的自然语言描述。
Number, Street, Side, City, County, State, Zipcode, Country:
事故地点的地址信息。
Timezone: 事故地点的时区。
Airport_Code: 最靠近事故地点的机场天气站代码。
Weather_Timestamp, Temperature(F), Wind_Chill(F), Humidity(%),
Pressure(in), Visibility(mi), Wind_Direction, Wind_Speed(mph),
Precipitation(in): 事故时的天气信息。
Weather_Condition: 事故时的天气状况，如雨、雪、雷暴、雾等。
Amenity, Bump, Crossing, Give_Way, Junction, No_Exit, Railway,
Roundabout, Station, Stop, Traffic_Calming, Traffic_Signal,
Turning_Loop: 事故地点附近的各种兴趣点（POI）的表明。

探索性分析（EDA）:

读入数据：

# import all necesary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import matplotlib.patches as mpatches
%matplotlib inline
import seaborn as sns
import calendar
import plotly as pt
from plotly import graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
from pylab import *
import matplotlib.patheffects as PathEffects
import descartes
import geopandas as gpd
from Levenshtein import distance
from itertools import product
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from scipy.spatial.distance import pdist, squareform
from shapely.geometry import Point, Polygon
import geoplot
from geopy.geocoders import Nominatim
import warnings
warnings.filterwarnings('ignore')
plt.rcParams['font.family'] = "Microsoft JhengHei UI Light"
plt.rcParams['font.serif'] = ["Microsoft JhengHei UI Light"]

复制代码

上面我们测试了一下将7728394行，46列数据读入内存的时间，总用时35.2秒：

CPU时间：

用户时间（User time）：25.1秒。这是步伐在用户模式下运行所花费的时间，即步伐执行自己的代码（不包括操作体系调用）所花费的时间。这个时间重要反映了步伐自己的工作负载。
体系时间（Systime）：7.32秒。这是步伐在内核模式下运行所花费的时间，即步伐执行操作体系调用（如文件I/O、内存管理等）所花费的时间。这个时间反映了步伐与操作体系交互的频率和复杂度。
总CPU时间（Total CPU time）：32.4秒。这是用户时间和体系时间的总和，表示步伐在CPU上总共花费的时间。

墙钟时间（Wall time）：
总共用时35.2秒。这是从开始执行步伐到步伐竣事所经过的现实时间，包括CPU时间、等候时间（如等候I/O操作完成）、步伐不运行的时间（如等候其他进程释放资源）等。
数据可视化

city_df = pd.DataFrame(df['City'].value_counts()).reset_index().rename(columns={'index':'City', 'City':'Cases'})
top_10_cities = pd.DataFrame(city_df.head(10))
fig, ax = plt.subplots(figsize = (12,7), dpi = 80)
cmap = cm.get_cmap('rainbow', 10)
clrs = [matplotlib.colors.rgb2hex(cmap(i)) for i in range(cmap.N)]
ax=sns.barplot(y=top_10_cities['Cases'], x=top_10_cities['City'], palette='rainbow')
total = sum(city_df['Cases'])
for i in ax.patches:
ax.text(i.get_x()+.03, i.get_height()-2500, \
str(round((i.get_height()/total)*100, 2))+'%', fontsize=15, weight='bold',
color='white')
plt.title('\nTop 10 Cities in US with most no. of \nRoad Accident Cases (2016-2020)\n', size=20, color='grey')
plt.rcParams['font.family'] = "Microsoft JhengHei UI Light"
plt.rcParams['font.serif'] = ["Microsoft JhengHei UI Light"]
plt.ylim(1000, 50000)
plt.xticks(rotation=10, fontsize=12)
plt.yticks(fontsize=12)
ax.set_xlabel('\nCities\n', fontsize=15, color='grey')
ax.set_ylabel('\nAccident Cases\n', fontsize=15, color='grey')
for i in ['bottom', 'left']:
ax.spines[i].set_color('white')
ax.spines[i].set_linewidth(1.5)
right_side = ax.spines["right"]
right_side.set_visible(False)
top_side = ax.spines["top"]
top_side.set_visible(False)
ax.set_axisbelow(True)
ax.grid(color='#b2d6c7', linewidth=1, axis='y', alpha=.3)
MA = mpatches.Patch(color=clrs[0], label='City with Maximum\n no. of Road Accidents')
ax.legend(handles=[MA], prop={'size': 10.5}, loc='best', borderpad=1,
labelcolor=clrs[0], edgecolor='white');
plt.show()

复制代码

从上图可以看出，美国的门路交通事故（2016-2020 年）数量最多的都会是洛杉矶，占全部交通事故的比重为2.64%，排名第二的都会是迈阿密，占全部交通事故的比重为2.39%。已往5年中大约有14%的事故仅来自美国10657个都会中的这10个都会。

已往 5 年（2016-2020 年）中，洛杉矶每年均匀发生 7,997 起交通事故.

states = gpd.read_file('../input/us-states-map')
def lat(city):
address=city
geolocator = Nominatim(user_agent="Your_Name")
location = geolocator.geocode(address)
return (location.latitude)
def lng(city):
address=city
geolocator = Nominatim(user_agent="Your_Name")
location = geolocator.geocode(address)
return (location.longitude)
# list of top 10 cities
top_ten_city_list = list(city_df.City.head(10))
top_ten_city_lat_dict = {}
top_ten_city_lng_dict = {}
for i in top_ten_city_list:
top_ten_city_lat_dict[i] = lat(i)
top_ten_city_lng_dict[i] = lng(i)
top_10_cities_df = df[df['City'].isin(list(top_10_cities.City))]
top_10_cities_df['New_Start_Lat'] = top_10_cities_df['City'].map(top_ten_city_lat_dict)
top_10_cities_df['New_Start_Lng'] = top_10_cities_df['City'].map(top_ten_city_lng_dict)

复制代码

geometry_cities = [Point(xy) for xy in zip(top_10_cities_df['New_Start_Lng'], top_10_cities_df['New_Start_Lat'])]
geo_df_cities = gpd.GeoDataFrame(top_10_cities_df, geometry=geometry_cities)

复制代码

fig, ax = plt.subplots(figsize=(15,15))
ax.set_xlim([-125,-65])
ax.set_ylim([22,55])
states.boundary.plot(ax=ax, color='grey');
colors = ['#e6194B','#f58231','#ffe119','#bfef45','#3cb44b', '#aaffc3','#42d4f4','#4363d8','#911eb4','#f032e6']
markersizes = [50+(i*20) for i in range(10)][::-1]
for i in range(10):
geo_df_cities[geo_df_cities['City'] == top_ten_city_list[i]].plot(ax=ax, markersize=markersizes[i],
color=colors[i], marker='o',
label=top_ten_city_list[i], alpha=0.7);
plt.legend(prop={'size': 13}, loc='best', bbox_to_anchor=(0.5, 0., 0.5, 0.5), edgecolor='white', title="Cities", title_fontsize=15);
for i in ['bottom', 'top', 'left', 'right']:
side = ax.spines[i]
side.set_visible(False)
plt.tick_params(top=False, bottom=False, left=False, right=False,
labelleft=False, labelbottom=False)
plt.title('\nVisualization of Top 10 Accident Prone Cities in US (2016-2020)', size=20, color='grey');

复制代码

交通事故排名前10的都会有3个不属于加利福尼亚州。

def city_cases_percentage(val, operator):
if operator == '<':
res = city_df[city_df['Cases']<val].shape[0]
elif operator == '>':
res = city_df[city_df['Cases']>val].shape[0]
elif operator == '=':
res = city_df[city_df['Cases']==val].shape[0]
print(f'{res} Cities, {round(res*100/city_df.shape[0], 2)}%')
city_cases_percentage(1, '=')
city_cases_percentage(100, '<')
city_cases_percentage(1000, '<')
city_cases_percentage(1000, '>')
city_cases_percentage(5000, '>')
city_cases_percentage(10000, '>')

复制代码

在此数据会合，我们总共有 10,657 个都会的记录：
在已往5年，只发生1起事故的都会有1167个，占美国全部都会的比重为11%。
在已往5年，美国全部的都会中，有8682个都会事故少于100起，占美国全部都会的比重为81%。
在已往5年，美国全部的都会中，有10406个都会事故少于1000起。
在已往5年，美国全部的都会中，有251个都会事故多于1000起。
在已往5年，美国全部的都会中，有40个都会事故多于5000起。
在已往5年，美国只有13个都会事故凌驾10000起。

# create a dictionary using US State code and their corresponding Name
us_states = {'AK': 'Alaska',
'AL': 'Alabama',
'AR': 'Arkansas',
'AS': 'American Samoa',
'AZ': 'Arizona',
'CA': 'California',
'CO': 'Colorado',
'CT': 'Connecticut',
'DC': 'District of Columbia',
'DE': 'Delaware',
'FL': 'Florida',
'GA': 'Georgia',
'GU': 'Guam',
'HI': 'Hawaii',
'IA': 'Iowa',
'ID': 'Idaho',
'IL': 'Illinois',
'IN': 'Indiana',
'KS': 'Kansas',
'KY': 'Kentucky',
'LA': 'Louisiana',
'MA': 'Massachusetts',
'MD': 'Maryland',
'ME': 'Maine',
'MI': 'Michigan',
'MN': 'Minnesota',
'MO': 'Missouri',
'MP': 'Northern Mariana Islands',
'MS': 'Mississippi',
'MT': 'Montana',
'NC': 'North Carolina',
'ND': 'North Dakota',
'NE': 'Nebraska',
'NH': 'New Hampshire',
'NJ': 'New Jersey',
'NM': 'New Mexico',
'NV': 'Nevada',
'NY': 'New York',
'OH': 'Ohio',
'OK': 'Oklahoma',
'OR': 'Oregon',
'PA': 'Pennsylvania',
'PR': 'Puerto Rico',
'RI': 'Rhode Island',
'SC': 'South Carolina',
'SD': 'South Dakota',
'TN': 'Tennessee',
'TX': 'Texas',
'UT': 'Utah',
'VA': 'Virginia',
'VI': 'Virgin Islands',
'VT': 'Vermont',
'WA': 'Washington',
'WI': 'Wisconsin',
'WV': 'West Virginia',
'WY': 'Wyoming'}
# create a dataframe of State and their corresponding accident cases
state_df = pd.DataFrame(df['State'].value_counts()).reset_index().rename(columns={'index':'State', 'State':'Cases'})
# Function to convert the State Code with the actual corressponding Name
def convert(x): return us_states[x]
state_df['State'] = state_df['State'].apply(convert)
top_ten_states_name = list(state_df['State'].head(10))

复制代码

fig, ax = plt.subplots(figsize = (12,6), dpi = 80)
cmap = cm.get_cmap('winter', 10)
clrs = [matplotlib.colors.rgb2hex(cmap(i)) for i in range(cmap.N)]
ax=sns.barplot(y=state_df['Cases'].head(10), x=state_df['State'].head(10), palette='winter')
ax1 = ax.twinx()
sns.lineplot(data = state_df[:10], marker='o', x='State', y='Cases', color = 'white', alpha = .8)
total = df.shape[0]
for i in ax.patches:
ax.text(i.get_x()-0.2, i.get_height()+10000, \
' {:,d}\n ({}%) '.format(int(i.get_height()), round(100*i.get_height()/total, 1)), fontsize=15,
color='black')
ax.set(ylim =(-10000, 600000))
ax1.set(ylim =(-100000, 1700000))
plt.title('\nTop 10 States with most no. of \nAccident cases in US (2016-2020)\n', size=20, color='grey')
ax1.axes.yaxis.set_visible(False)
ax.set_xlabel('\nStates\n', fontsize=15, color='grey')
ax.set_ylabel('\nAccident Cases\n', fontsize=15, color='grey')
for i in ['top','right']:
side1 = ax.spines[i]
side1.set_visible(False)
side2 = ax1.spines[i]
side2.set_visible(False)
ax.set_axisbelow(True)
ax.grid(color='#b2d6c7', linewidth=1, axis='y', alpha=.3)
ax.spines['bottom'].set_bounds(0.005, 9)
ax.spines['left'].set_bounds(0, 600000)
ax1.spines['bottom'].set_bounds(0.005, 9)
ax1.spines['left'].set_bounds(0, 600000)
ax.tick_params(axis='y', which='major', labelsize=10.6)
ax.tick_params(axis='x', which='major', labelsize=10.6, rotation=10)
MA = mpatches.Patch(color=clrs[0], label='State with Maximum\n no. of Road Accidents')
ax.legend(handles=[MA], prop={'size': 10.5}, loc='best', borderpad=1,
labelcolor=clrs[0], edgecolor='white');

复制代码

在已往5年，美国全部的都会中，加利福尼亚州是事故排名最高的州，约占全部交通事故的比重为30%，均匀每天发生246起事故，意味着每小时约10起交通事故。
佛罗里达州是事故排名第二的州，约占全部交通事故的比重为10%。

geometry = [Point(xy) for xy in zip(df['Start_Lng'], df['Start_Lat'])]
geo_df = gpd.GeoDataFrame(df, geometry=geometry)
geo_df['year'] = geo_df.Start_Time.dt.year
geo_df['State'] = geo_df['State'].apply(convert)

复制代码

fig, ax = plt.subplots(figsize=(15,15))
ax.set_xlim([-125,-65])
ax.set_ylim([22,55])
states.boundary.plot(ax=ax, color='grey');
states.apply(lambda x: None if (x.NAME not in top_ten_states_name) else ax.annotate(s=x.NAME, xy=x.geometry.centroid.coords[0], ha='center', color='black', weight='bold', fontsize=12.5), axis=1);
# CFOTNYMVNPI
colors = ['#FF5252','#9575CD','#FF8A80','#FF4081','#FFEE58','#7C4DFF','#00E5FF','#81D4FA','#64FFDA','#8C9EFF']
count = 0
for i in list(state_df['State'].head(10)):
geo_df[geo_df['State'] == i].plot(ax=ax, markersize=1, color=colors[count], marker='o');
count += 1
for i in ['bottom', 'top', 'left', 'right']:
side = ax.spines[i]
side.set_visible(False)
plt.tick_params(top=False, bottom=False, left=False, right=False,
labelleft=False, labelbottom=False)
plt.title('\nVisualization of Top 10 Accident Prone States in US (2016-2020)', size=20, color='grey');

复制代码

fig, ax = plt.subplots(figsize = (12,6), dpi = 80)
cmap = cm.get_cmap('cool', 10)
clrs = [matplotlib.colors.rgb2hex(cmap(i)) for i in range(cmap.N)]
ax=sns.barplot(y=state_df['Cases'].tail(10), x=state_df['State'].tail(10), palette='cool')
ax1 = ax.twinx()
sns.lineplot(data = state_df[-10:], marker='o', x='State', y='Cases', color = 'white', alpha = .8)
total = df.shape[0]
for i in ax.patches:
ax.text(i.get_x()-0.1, i.get_height()+100, \
' {:,d}\n({}%) '.format(int(i.get_height()), round(100*i.get_height()/total, 2)), fontsize=15,
color='black')
ax.set(ylim =(-50, 5000))
ax1.set(ylim =(-50, 6000))
plt.title('\nTop 10 States with least no. of \nAccident cases in US (2016-2020)\n', size=20, color='grey')
ax1.axes.yaxis.set_visible(False)
ax.set_xlabel('\nStates\n', fontsize=15, color='grey')
ax.set_ylabel('\nAccident Cases\n', fontsize=15, color='grey')
for i in ['top', 'right']:
side = ax.spines[i]
side.set_visible(False)
side1 = ax1.spines[i]
side1.set_visible(False)
ax.set_axisbelow(True)
ax.grid(color='#b2d6c7', linewidth=1, axis='y', alpha=.3)
ax.spines['bottom'].set_bounds(0.005, 9)
ax.spines['left'].set_bounds(0, 5000)
ax1.spines['bottom'].set_bounds(0.005, 9)
ax1.spines['left'].set_bounds(0, 5000)
ax.tick_params(axis='y', which='major', labelsize=11)
ax.tick_params(axis='x', which='major', labelsize=11, rotation=15)
MI = mpatches.Patch(color=clrs[-1], label='State with Minimum\n no. of Road Accidents')
ax.legend(handles=[MI], prop={'size': 10.5}, loc='best', borderpad=1,
labelcolor=clrs[-1], edgecolor='white');

复制代码

在已往5年，美国全部的都会中，南达科他州是事故排名数量最低的都会，仅发生了213起事故，意味着均匀每年发生42起事故。

timezone_df = pd.DataFrame(df['Timezone'].value_counts()).reset_index().rename(columns={'index':'Timezone', 'Timezone':'Cases'})

复制代码

fig, ax = plt.subplots(figsize = (10,6), dpi = 80)
cmap = cm.get_cmap('spring', 4)
clrs = [matplotlib.colors.rgb2hex(cmap(i)) for i in range(cmap.N)]
ax=sns.barplot(y=timezone_df['Cases'], x=timezone_df['Timezone'], palette='spring')
total = df.shape[0]
for i in ax.patches:
ax.text(i.get_x()+0.3, i.get_height()-50000, \
'{}%'.format(round(i.get_height()*100/total)), fontsize=15,weight='bold',
color='white')
plt.ylim(-20000, 700000)
plt.title('\nPercentage of Accident Cases for \ndifferent Timezone in US (2016-2020)\n', size=20, color='grey')
plt.ylabel('\nAccident Cases\n', fontsize=15, color='grey')
plt.xlabel('\nTimezones\n', fontsize=15, color='grey')
plt.xticks(fontsize=13)
plt.yticks(fontsize=12)
for i in ['top', 'right']:
side = ax.spines[i]
side.set_visible(False)
ax.set_axisbelow(True)
ax.grid(color='#b2d6c7', linewidth=1, axis='y', alpha=.3)
ax.spines['bottom'].set_bounds(0.005, 3)
ax.spines['left'].set_bounds(0, 700000)
MA = mpatches.Patch(color=clrs[0], label='Timezone with Maximum\n no. of Road Accidents')
MI = mpatches.Patch(color=clrs[-1], label='Timezone with Minimum\n no. of Road Accidents')
ax.legend(handles=[MA, MI], prop={'size': 10.5}, loc='best', borderpad=1,
labelcolor=[clrs[0], 'grey'], edgecolor='white');

复制代码

从时区来看，美国东部时区交通事故案件最高，占全部事故案件比重为39%，山区时区数量最低，占比仅为6%。

fig, ax = plt.subplots(figsize=(15,15))
ax.set_xlim([-125,-65])
ax.set_ylim([22,55])
states.boundary.plot(ax=ax, color='black');
colors = ['#00db49', '#ff5e29', '#88ff33', '#fffb29']
#4132
count = 0
for i in list(timezone_df.Timezone):
geo_df[geo_df['Timezone'] == i].plot(ax=ax, markersize=1, color=colors[count], marker='o', label=i);
count += 1
plt.legend(markerscale=10., prop={'size': 15}, edgecolor='white', title="Timezones", title_fontsize=15, loc='lower right');
for i in ['bottom', 'top', 'left', 'right']:
side = ax.spines[i]
side.set_visible(False)
plt.tick_params(top=False, bottom=False, left=False, right=False,
labelleft=False, labelbottom=False)
plt.title('\nVisualization of Road Accidents \nfor different Timezones in US (2016-2020)', size=20, color='grey');

复制代码

street_df = pd.DataFrame(df['Street'].value_counts()).reset_index().rename(columns={'index':'Street No.', 'Street':'Cases'})
top_ten_streets_df = pd.DataFrame(street_df.head(10))

复制代码

fig, ax = plt.subplots(figsize = (12,6), dpi = 80)
cmap = cm.get_cmap('gnuplot2', 10)
clrs = [matplotlib.colors.rgb2hex(cmap(i)) for i in range(cmap.N)]
ax=sns.barplot(y=top_ten_streets_df['Cases'], x=top_ten_streets_df['Street No.'], palette='gnuplot2')
ax1 = ax.twinx()
sns.lineplot(data = top_ten_streets_df, marker='o', x='Street No.', y='Cases', color = 'white', alpha = .8)
total = df.shape[0]
for i in ax.patches:
ax.text(i.get_x()+0.04, i.get_height()-2000, \
'{:,d}'.format(int(i.get_height())), fontsize=12.5,weight='bold',
color='white')
ax.axes.set_ylim(-1000, 30000)
ax1.axes.set_ylim(-1000, 40000)
plt.title('\nTop 10 Accident Prone Streets in US (2016-2020)\n', size=20, color='grey')
ax1.axes.yaxis.set_visible(False)
ax.set_xlabel('\nStreet No.\n', fontsize=15, color='grey')
ax.set_ylabel('\nAccident Cases\n', fontsize=15, color='grey')
for i in ['top','right']:
side1 = ax.spines[i]
side1.set_visible(False)
side2 = ax1.spines[i]
side2.set_visible(False)
ax.set_axisbelow(True)
ax.grid(color='#b2d6c7', linewidth=1, axis='y', alpha=.3)
ax.spines['bottom'].set_bounds(0.005, 9)
ax.spines['left'].set_bounds(0, 30000)
ax1.spines['bottom'].set_bounds(0.005, 9)
ax1.spines['left'].set_bounds(0, 30000)
ax.tick_params(axis='both', which='major', labelsize=12)
MA = mpatches.Patch(color=clrs[1], label='Street with Maximum\n no. of Road Accidents')
MI = mpatches.Patch(color=clrs[-2], label='Street with Minimum\n no. of Road Accidents')
ax.legend(handles=[MA, MI], prop={'size': 10.5}, loc='best', borderpad=1,
labelcolor=[clrs[1], 'grey'], edgecolor='white');

复制代码

在已往5年，美国全部的街道中，I-5 N 号街道事故记录最高，均匀每天发生14起事故。

def street_cases_percentage(val, operator):
if operator == '=':
val = street_df[street_df['Cases']==val].shape[0]
elif operator == '>':
val = street_df[street_df['Cases']>val].shape[0]
elif operator == '<':
val = street_df[street_df['Cases']<val].shape[0]
print('{:,d} Streets, {}%'.format(val, round(val*100/street_df.shape[0], 2)))
street_cases_percentage(1, '=')
street_cases_percentage(100, '<')
street_cases_percentage(1000, '<')
street_cases_percentage(1000, '>')
street_cases_percentage(5000, '>')

复制代码

在已往5年，美国有93048条街道发买卖外事故。此中，36441条街道（39%）在已往5年只有1起事故；98%的街道事故少于100起；街道事故凌驾1000起的仅占0.2%。有24条街道事故超5000起。

severity_df = pd.DataFrame(df['Severity'].value_counts()).rename(columns={'index':'Severity', 'Severity':'Cases'})
fig = go.Figure(go.Funnelarea(
text = ["Severity - 2","Severity - 3", "Severity - 4", "Severity - 1"],
values = severity_df.Cases,
title = {"position": "top center",
"text": "<b>Impact on the Traffic due to the Accidents</b>",
'font':dict(size=18,color="#7f7f7f")},
marker = {"colors": ['#14a3ee', '#b4e6ee', '#fdf4b8', '#ff4f4e'],
"line": {"color": ["#e8e8e8", "wheat", "wheat", "wheat"], "width": [7, 0, 0, 2]}}
))
fig.show()

复制代码

在已往5年中，有80%事故对交通影响为中等，严峻影响的仅占7.5%。

fig, ax = plt.subplots(figsize=(15,15))
ax.set_xlim([-125,-65])
ax.set_ylim([22,55])
states.boundary.plot(ax=ax, color='black');
geo_df[geo_df['Severity'] == 1].plot(ax=ax, markersize=50, color='#5cff4a', marker='o', label='Severity 1');
geo_df[geo_df['Severity'] == 3].plot(ax=ax, markersize=10, color='#ff1c1c', marker='x', label='Severity 3');
geo_df[geo_df['Severity'] == 4].plot(ax=ax, markersize=1, color='#6459ff', marker='v', label='Severity 4');
geo_df[geo_df['Severity'] == 2].plot(ax=ax, markersize=5, color='#ffb340', marker='+', label='Severity 2');
for i in ['bottom', 'top', 'left', 'right']:
side = ax.spines[i]
side.set_visible(False)
plt.tick_params(top=False, bottom=False, left=False, right=False,
labelleft=False, labelbottom=False)
plt.title('\nDifferent level of Severity visualization in US map', size=20, color='grey');
One = mpatches.Patch(color='#5cff4a', label='Severity 1')
Two = mpatches.Patch(color='#ffb340', label='Severity 2')
Three = mpatches.Patch(color='#ff1c1c', label='Severity 3')
Four = mpatches.Patch(color='#6459ff', label='Severity 4')
ax.legend(handles=[One, Two, Three, Four], prop={'size': 15}, loc='lower right', borderpad=1,
labelcolor=['#5cff4a', '#ffb340', '#ff1c1c', '#6459ff'], edgecolor='white');

复制代码

accident_duration_df = pd.DataFrame(df['End_Time'] - df['Start_Time']).reset_index().rename(columns={'index':'Id', 0:'Duration'})
top_10_accident_duration_df = pd.DataFrame(accident_duration_df['Duration'].value_counts().head(10).sample(frac = 1)).reset_index().rename(columns={'index':'Duration', 'Duration':'Cases'})
Duration = [str(i).split('days')[-1].strip() for i in top_10_accident_duration_df.Duration]
top_10_accident_duration_df['Duration'] = Duration

复制代码

fig, ax = plt.subplots(figsize = (12,6), dpi = 80)
ax.set_facecolor('#e6f2ed')
fig.patch.set_facecolor('#e6f2ed')
cmap = cm.get_cmap('bwr', 10)
clrs = [matplotlib.colors.rgb2hex(cmap(i)) for i in range(cmap.N)]
ax=sns.barplot(y=top_10_accident_duration_df['Cases'], x=top_10_accident_duration_df['Duration'], palette='bwr')
ax1 = ax.twinx()
sns.lineplot(data = top_10_accident_duration_df, marker='o', x='Duration', y='Cases', color = 'white', alpha = 1)
total = df.shape[0]
for i in ax.patches:
ax.text(i.get_x(), i.get_height()+5000, \
str(round((i.get_height()/total)*100, 2))+'%', fontsize=15,
color='black')
ax.set(ylim =(1000, 400000))
ax1.set(ylim =(1000, 500000))
plt.title('\nMost Impacted Durations on the \nTraffic flow due to the Accidents \n', size=20, color='grey')
ax1.axes.yaxis.set_visible(False)
ax.set_xlabel('\nDuration of Accident (HH:MM:SS)\n', fontsize=15, color='grey')
ax.set_ylabel('\nAccident Cases\n', fontsize=15, color='grey')
for i in ['bottom', 'top', 'left', 'right']:
ax.spines[i].set_color('white')
ax.spines[i].set_linewidth(1.5)
ax1.spines[i].set_color('white')
ax1.spines[i].set_linewidth(1.5)
ax.set_axisbelow(True)
ax.grid(color='white', linewidth=1.5)
ax.tick_params(axis='both', which='major', labelsize=12)
MA = mpatches.Patch(color=clrs[-3], label='Duration with Maximum\n no. of Road Accidents')
ax.legend(handles=[MA], prop={'size': 10.5}, loc='best', borderpad=1,
labelcolor=clrs[-3], facecolor='#e6f2ed', edgecolor='#e6f2ed');

复制代码

从上图可以推断，大部分（24.25%）门路交通事故对交通流量的影响持续了6小时。

year_df = pd.DataFrame(df.Start_Time.dt.year.value_counts()).reset_index().rename(columns={'index':'Year', 'Start_Time':'Cases'}).sort_values(by='Cases', ascending=True)

复制代码

fig, ax = plt.subplots(figsize = (12,6), dpi = 80)
ax=sns.barplot(y=year_df['Cases'], x=year_df['Year'], palette=['#9a90e8', '#5d82de', '#3ee6e0', '#40ff53','#2ee88e'])
total = df.shape[0]
for i in ax.patches:
ax.text(i.get_x()+0.2, i.get_height()-50000, \
str(round((i.get_height()/total)*100, 2))+'%', fontsize=15,weight='bold',
color='white')
plt.ylim(10000, 900000)
plt.title('\nRoad Accident Percentage \nover past 5 Years in US (2016-2020)\n', size=20, color='grey')
plt.ylabel('\nAccident Cases\n', fontsize=15, color='grey')
plt.xlabel('\nYears\n', fontsize=15, color='grey')
plt.xticks(fontsize=13)
plt.yticks(fontsize=12)
for i in ['bottom', 'top', 'left', 'right']:
ax.spines[i].set_color('white')
ax.spines[i].set_linewidth(1.5)
for k in ['top', 'right', "bottom", 'left']:
side = ax.spines[k]
side.set_visible(False)
ax.set_axisbelow(True)
ax.grid(color='#b2d6c7', linewidth=1, axis='y', alpha=0.3)
MA = mpatches.Patch(color='#2ee88e', label='Year with Maximum\n no. of Road Accidents')
MI = mpatches.Patch(color='#9a90e8', label='Year with Minimum\n no. of Road Accidents')
ax.legend(handles=[MA, MI], prop={'size': 10.5}, loc='best', borderpad=1,
labelcolor=['#2ee88e', '#9a90e8'], edgecolor='white');
plt.show()

复制代码

从上图可以看出，在已往 5 年（2016-2020 年）中，美国的事故百分比显着增加，有 70% 仅发生在已往 2 年（2019 年、2020 年）内。

fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(nrows=3, ncols=2, figsize=(15, 10))
fig.suptitle('Accident Cases over the past 5 years in US', fontsize=20,fontweight ="bold", color='grey')
count = 0
years = ['2016', '2017', '2018', '2019', '2020']
colors = ['#77fa5a', '#ffff4d', '#ffab36', '#ff894a', '#ff513b']
for i in [ax1, ax2, ax3, ax4, ax5]:
i.set_xlim([-125,-65])
i.set_ylim([22,55])
states.boundary.plot(ax=i, color='black');
geo_df[geo_df['year']==int(years[count])].plot(ax=i, markersize=1, color=colors[count], marker='+', alpha=0.5)
for j in ['bottom', 'top', 'left', 'right']:
side = i.spines[j]
side.set_visible(False)
i.set_title(years[count] + '\n({:,} Road Accident Cases)'.format(list(year_df.Cases)[count]), fontsize=12, color='grey', weight='bold')
i.axis('off')
count += 1
sns.lineplot(data = year_df, marker='o', x='Year', y='Cases', color = '#734dff', ax=ax6, label="Yearly Road Accidents");
for k in ['bottom', 'top', 'left', 'right']:
side = ax6.spines[k]
side.set_visible(False)
ax6.xaxis.set_ticks(year_df.Year);
ax6.legend(prop={'size': 12}, loc='best', edgecolor='white');

复制代码

accident_severity_df = geo_df.groupby(['year', 'Severity']).size().unstack()

复制代码

ax = accident_severity_df.plot(kind='barh', stacked=True, figsize=(12, 6),
color=['#fcfa5d', '#ffe066', '#fab666', '#f68f6a'],
rot=0);
ax.set_title('\nSeverity and Corresponding Accident \nPercentage for past 5 years in US\n', fontsize=20, color='grey');
for i in ['top', 'left', 'right']:
side = ax.spines[i]
side.set_visible(False)
ax.spines['bottom'].set_bounds(0, 800000);
ax.set_ylabel('\nYears\n', fontsize=15, color='grey');
ax.set_xlabel('\nAccident Cases\n', fontsize=15, color='grey');
ax.legend(prop={'size': 12.5}, loc='best', fancybox = True, title="Severity", title_fontsize=15, edgecolor='white');
ax.tick_params(axis='both', which='major', labelsize=12.5)
#ax.set_facecolor('#e6f2ed')
for p in ax.patches:
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
var = width*100/df.shape[0]
if var > 0:
if var > 4:
ax.text(x+width/2,
y+height/2-0.05,
'{:.2f}%'.format(width*100/df.shape[0]),
fontsize=12, color='black', alpha= 0.8)
elif var > 1.8 and var < 3.5:
ax.text(x+width/2-17000,
y+height/2-0.05,
'{:.2f}%'.format(width*100/df.shape[0]),
fontsize=12, color='black', alpha= 0.8)
elif var>1.5 and var<1.8:
ax.text(x+width/2+7000,
y+height/2-0.05,
' {:.2f}%'.format(width*100/df.shape[0]),
fontsize=12, color='black', alpha= 0.8)
elif var>1:
ax.text(x+width/2-20000,
y+height/2-0.05,
' {:.2f}%'.format(width*100/df.shape[0]),
fontsize=12, color='black', alpha= 0.8)
else:
ax.text(x+width/2+10000,
y+height/2-0.05,
' {:.2f}%'.format(width*100/df.shape[0]),
fontsize=12, color='black', alpha= 0.8)

复制代码

已往4年（2017-2020年），美国高度严峻的意外个案维持在1.55%至1.8%之间，仅在 2020 年发生的已往 5 年门路交通事故总数中，有 45% 是中度严峻。
小结

本文旨在测试大型数据集在家用电脑读入内存的上限，但数据量有限未能测试出结果，顺便研究了一下表现美国地图的模块，很费时间，表现中文也有问题，末了只能用翻译软件转为英文给大家展示，感兴趣的朋友可以继续研究，反正我是要放弃这款了。

免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！更多信息从访问主页：qidao123.com:ToB企服之家，中国第一个企服评测及商务社交产业平台。

		自动登录	找回密码
密码			立即注册

【大数据】—美国交通事故分析（2016 年 2 月至 2020 年 12 月） ...

本帖子中包含更多资源

0 个回复

快速回复

楼主热帖

标签云

浏览过的版块