引言
在当今快速发展的数字时代,大数据已成为我们理解世界、做出决策的重要工具。特别是在交通安全范畴,大数据分析能够揭示事故模式、辨认风险因素,并资助订定防备措施,从而挽救生命。本文将深入探讨2016年2月至2020年12月期间,美国交通事故的大数据集,旨在通过数据分析揭示交通事故的内涵规律和趋势。
配景
这是一个美国天下性的车祸数据集,涵盖美国 49 个州。事故数据是在 2016 年 2 月至 2020 年 12 月期间网络的,利用多个提供流式交通事件(或事件)数据的 API。这些 API 由各种实体捕捉的交通数据,例如美国和州交通部门、执法机构、交通摄像头和门路网络内的交通传感器获取。目前,该数据会合约有 773万条事故记录。
目的
从Excel 2007开始,读取行数上限增加到了1,048,576行,超出的行数就不能被打开了,工作中目前能碰到的也就1万多不到2万条数据。今天用下图数据集测试家用电脑的数据承载能力,3.06G相当大了,工作中基本是遇不到这么大的数据。
数据集信息
- ID: 事故记录的唯一标识符。
- Severity: 事故严峻水平,从1到4的数字,1表示对交通影响最小,4表示影响最大。
- Start_Time 和 End_Time: 事故开始和竣事时间,以当地时间表示。
- Start_Lat 和 Start_Lng: 事故开始点的经纬度坐标。
- End_Lat 和 End_Lng: 事故影响竣事点的经纬度坐标,可能为空。
- Distance(mi): 受事故影响的门路长度。
- Description: 事故的自然语言描述。
- Number, Street, Side, City, County, State, Zipcode, Country:
事故地点的地址信息。
- Timezone: 事故地点的时区。
- Airport_Code: 最靠近事故地点的机场天气站代码。
- Weather_Timestamp, Temperature(F), Wind_Chill(F), Humidity(%),
Pressure(in), Visibility(mi), Wind_Direction, Wind_Speed(mph),
Precipitation(in): 事故时的天气信息。
- Weather_Condition: 事故时的天气状况,如雨、雪、雷暴、雾等。
- Amenity, Bump, Crossing, Give_Way, Junction, No_Exit, Railway,
Roundabout, Station, Stop, Traffic_Calming, Traffic_Signal,
Turning_Loop: 事故地点附近的各种兴趣点(POI)的表明。
探索性分析(EDA):
读入数据:
- # import all necesary libraries
- import numpy as np
- import pandas as pd
- import matplotlib.pyplot as plt
- import matplotlib.ticker as ticker
- import matplotlib.patches as mpatches
- %matplotlib inline
- import seaborn as sns
- import calendar
- import plotly as pt
- from plotly import graph_objs as go
- import plotly.express as px
- import plotly.figure_factory as ff
- from pylab import *
- import matplotlib.patheffects as PathEffects
- import descartes
- import geopandas as gpd
- from Levenshtein import distance
- from itertools import product
- from fuzzywuzzy import fuzz
- from fuzzywuzzy import process
- from scipy.spatial.distance import pdist, squareform
- from shapely.geometry import Point, Polygon
- import geoplot
- from geopy.geocoders import Nominatim
- import warnings
- warnings.filterwarnings('ignore')
- plt.rcParams['font.family'] = "Microsoft JhengHei UI Light"
- plt.rcParams['font.serif'] = ["Microsoft JhengHei UI Light"]
复制代码
上面我们测试了一下将7728394行,46列数据读入内存的时间,总用时35.2秒:
CPU时间:
- 用户时间(User time):25.1秒。这是步伐在用户模式下运行所花费的时间,即步伐执行自己的代码(不包括操作体系调用)所花费的时间。这个时间重要反映了步伐自己的工作负载。
- 体系时间(Systime):7.32秒。这是步伐在内核模式下运行所花费的时间,即步伐执行操作体系调用(如文件I/O、内存管理等)所花费的时间。这个时间反映了步伐与操作体系交互的频率和复杂度。
- 总CPU时间(Total CPU time):32.4秒。这是用户时间和体系时间的总和,表示步伐在CPU上总共花费的时间。
墙钟时间(Wall time):
总共用时35.2秒。这是从开始执行步伐到步伐竣事所经过的现实时间,包括CPU时间、等候时间(如等候I/O操作完成)、步伐不运行的时间(如等候其他进程释放资源)等。
数据可视化
- city_df = pd.DataFrame(df['City'].value_counts()).reset_index().rename(columns={'index':'City', 'City':'Cases'})
- top_10_cities = pd.DataFrame(city_df.head(10))
- fig, ax = plt.subplots(figsize = (12,7), dpi = 80)
- cmap = cm.get_cmap('rainbow', 10)
- clrs = [matplotlib.colors.rgb2hex(cmap(i)) for i in range(cmap.N)]
- ax=sns.barplot(y=top_10_cities['Cases'], x=top_10_cities['City'], palette='rainbow')
- total = sum(city_df['Cases'])
- for i in ax.patches:
- ax.text(i.get_x()+.03, i.get_height()-2500, \
- str(round((i.get_height()/total)*100, 2))+'%', fontsize=15, weight='bold',
- color='white')
- plt.title('\nTop 10 Cities in US with most no. of \nRoad Accident Cases (2016-2020)\n', size=20, color='grey')
- plt.rcParams['font.family'] = "Microsoft JhengHei UI Light"
- plt.rcParams['font.serif'] = ["Microsoft JhengHei UI Light"]
- plt.ylim(1000, 50000)
- plt.xticks(rotation=10, fontsize=12)
- plt.yticks(fontsize=12)
- ax.set_xlabel('\nCities\n', fontsize=15, color='grey')
- ax.set_ylabel('\nAccident Cases\n', fontsize=15, color='grey')
- for i in ['bottom', 'left']:
- ax.spines[i].set_color('white')
- ax.spines[i].set_linewidth(1.5)
-
- right_side = ax.spines["right"]
- right_side.set_visible(False)
- top_side = ax.spines["top"]
- top_side.set_visible(False)
- ax.set_axisbelow(True)
- ax.grid(color='#b2d6c7', linewidth=1, axis='y', alpha=.3)
- MA = mpatches.Patch(color=clrs[0], label='City with Maximum\n no. of Road Accidents')
- ax.legend(handles=[MA], prop={'size': 10.5}, loc='best', borderpad=1,
- labelcolor=clrs[0], edgecolor='white');
- plt.show()
复制代码 从上图可以看出,美国的门路交通事故(2016-2020 年)数量最多的都会是洛杉矶,占全部交通事故的比重为2.64%,排名第二的都会是迈阿密,占全部交通事故的比重为2.39%。已往5年中大约有14%的事故仅来自美国10657个都会中的这10个都会。
已往 5 年(2016-2020 年)中,洛杉矶每年均匀发生 7,997 起交通事故.
- states = gpd.read_file('../input/us-states-map')
- def lat(city):
- address=city
- geolocator = Nominatim(user_agent="Your_Name")
- location = geolocator.geocode(address)
- return (location.latitude)
- def lng(city):
- address=city
- geolocator = Nominatim(user_agent="Your_Name")
- location = geolocator.geocode(address)
- return (location.longitude)
- # list of top 10 cities
- top_ten_city_list = list(city_df.City.head(10))
- top_ten_city_lat_dict = {}
- top_ten_city_lng_dict = {}
- for i in top_ten_city_list:
- top_ten_city_lat_dict[i] = lat(i)
- top_ten_city_lng_dict[i] = lng(i)
-
- top_10_cities_df = df[df['City'].isin(list(top_10_cities.City))]
- top_10_cities_df['New_Start_Lat'] = top_10_cities_df['City'].map(top_ten_city_lat_dict)
- top_10_cities_df['New_Start_Lng'] = top_10_cities_df['City'].map(top_ten_city_lng_dict)
复制代码- geometry_cities = [Point(xy) for xy in zip(top_10_cities_df['New_Start_Lng'], top_10_cities_df['New_Start_Lat'])]
- geo_df_cities = gpd.GeoDataFrame(top_10_cities_df, geometry=geometry_cities)
复制代码- fig, ax = plt.subplots(figsize=(15,15))
- ax.set_xlim([-125,-65])
- ax.set_ylim([22,55])
- states.boundary.plot(ax=ax, color='grey');
- colors = ['#e6194B','#f58231','#ffe119','#bfef45','#3cb44b', '#aaffc3','#42d4f4','#4363d8','#911eb4','#f032e6']
- markersizes = [50+(i*20) for i in range(10)][::-1]
- for i in range(10):
- geo_df_cities[geo_df_cities['City'] == top_ten_city_list[i]].plot(ax=ax, markersize=markersizes[i],
- color=colors[i], marker='o',
- label=top_ten_city_list[i], alpha=0.7);
-
- plt.legend(prop={'size': 13}, loc='best', bbox_to_anchor=(0.5, 0., 0.5, 0.5), edgecolor='white', title="Cities", title_fontsize=15);
- for i in ['bottom', 'top', 'left', 'right']:
- side = ax.spines[i]
- side.set_visible(False)
-
- plt.tick_params(top=False, bottom=False, left=False, right=False,
- labelleft=False, labelbottom=False)
- plt.title('\nVisualization of Top 10 Accident Prone Cities in US (2016-2020)', size=20, color='grey');
复制代码 交通事故排名前10的都会有3个不属于加利福尼亚州。
- def city_cases_percentage(val, operator):
- if operator == '<':
- res = city_df[city_df['Cases']<val].shape[0]
- elif operator == '>':
- res = city_df[city_df['Cases']>val].shape[0]
- elif operator == '=':
- res = city_df[city_df['Cases']==val].shape[0]
- print(f'{res} Cities, {round(res*100/city_df.shape[0], 2)}%')
-
-
- city_cases_percentage(1, '=')
- city_cases_percentage(100, '<')
- city_cases_percentage(1000, '<')
- city_cases_percentage(1000, '>')
- city_cases_percentage(5000, '>')
- city_cases_percentage(10000, '>')
复制代码
在此数据会合,我们总共有 10,657 个都会的记录:
在已往5年,只发生1起事故的都会有1167个,占美国全部都会的比重为11%。
在已往5年,美国全部的都会中,有8682个都会事故少于100起,占美国全部都会的比重为81%。
在已往5年,美国全部的都会中,有10406个都会事故少于1000起。
在已往5年,美国全部的都会中,有251个都会事故多于1000起。
在已往5年,美国全部的都会中,有40个都会事故多于5000起。
在已往5年,美国只有13个都会事故凌驾10000起。
- # create a dictionary using US State code and their corresponding Name
- us_states = {'AK': 'Alaska',
- 'AL': 'Alabama',
- 'AR': 'Arkansas',
- 'AS': 'American Samoa',
- 'AZ': 'Arizona',
- 'CA': 'California',
- 'CO': 'Colorado',
- 'CT': 'Connecticut',
- 'DC': 'District of Columbia',
- 'DE': 'Delaware',
- 'FL': 'Florida',
- 'GA': 'Georgia',
- 'GU': 'Guam',
- 'HI': 'Hawaii',
- 'IA': 'Iowa',
- 'ID': 'Idaho',
- 'IL': 'Illinois',
- 'IN': 'Indiana',
- 'KS': 'Kansas',
- 'KY': 'Kentucky',
- 'LA': 'Louisiana',
- 'MA': 'Massachusetts',
- 'MD': 'Maryland',
- 'ME': 'Maine',
- 'MI': 'Michigan',
- 'MN': 'Minnesota',
- 'MO': 'Missouri',
- 'MP': 'Northern Mariana Islands',
- 'MS': 'Mississippi',
- 'MT': 'Montana',
- 'NC': 'North Carolina',
- 'ND': 'North Dakota',
- 'NE': 'Nebraska',
- 'NH': 'New Hampshire',
- 'NJ': 'New Jersey',
- 'NM': 'New Mexico',
- 'NV': 'Nevada',
- 'NY': 'New York',
- 'OH': 'Ohio',
- 'OK': 'Oklahoma',
- 'OR': 'Oregon',
- 'PA': 'Pennsylvania',
- 'PR': 'Puerto Rico',
- 'RI': 'Rhode Island',
- 'SC': 'South Carolina',
- 'SD': 'South Dakota',
- 'TN': 'Tennessee',
- 'TX': 'Texas',
- 'UT': 'Utah',
- 'VA': 'Virginia',
- 'VI': 'Virgin Islands',
- 'VT': 'Vermont',
- 'WA': 'Washington',
- 'WI': 'Wisconsin',
- 'WV': 'West Virginia',
- 'WY': 'Wyoming'}
- # create a dataframe of State and their corresponding accident cases
- state_df = pd.DataFrame(df['State'].value_counts()).reset_index().rename(columns={'index':'State', 'State':'Cases'})
- # Function to convert the State Code with the actual corressponding Name
- def convert(x): return us_states[x]
- state_df['State'] = state_df['State'].apply(convert)
- top_ten_states_name = list(state_df['State'].head(10))
复制代码- fig, ax = plt.subplots(figsize = (12,6), dpi = 80)
- cmap = cm.get_cmap('winter', 10)
- clrs = [matplotlib.colors.rgb2hex(cmap(i)) for i in range(cmap.N)]
- ax=sns.barplot(y=state_df['Cases'].head(10), x=state_df['State'].head(10), palette='winter')
- ax1 = ax.twinx()
- sns.lineplot(data = state_df[:10], marker='o', x='State', y='Cases', color = 'white', alpha = .8)
- total = df.shape[0]
- for i in ax.patches:
- ax.text(i.get_x()-0.2, i.get_height()+10000, \
- ' {:,d}\n ({}%) '.format(int(i.get_height()), round(100*i.get_height()/total, 1)), fontsize=15,
- color='black')
- ax.set(ylim =(-10000, 600000))
- ax1.set(ylim =(-100000, 1700000))
- plt.title('\nTop 10 States with most no. of \nAccident cases in US (2016-2020)\n', size=20, color='grey')
- ax1.axes.yaxis.set_visible(False)
- ax.set_xlabel('\nStates\n', fontsize=15, color='grey')
- ax.set_ylabel('\nAccident Cases\n', fontsize=15, color='grey')
- for i in ['top','right']:
- side1 = ax.spines[i]
- side1.set_visible(False)
- side2 = ax1.spines[i]
- side2.set_visible(False)
-
- ax.set_axisbelow(True)
- ax.grid(color='#b2d6c7', linewidth=1, axis='y', alpha=.3)
- ax.spines['bottom'].set_bounds(0.005, 9)
- ax.spines['left'].set_bounds(0, 600000)
- ax1.spines['bottom'].set_bounds(0.005, 9)
- ax1.spines['left'].set_bounds(0, 600000)
- ax.tick_params(axis='y', which='major', labelsize=10.6)
- ax.tick_params(axis='x', which='major', labelsize=10.6, rotation=10)
- MA = mpatches.Patch(color=clrs[0], label='State with Maximum\n no. of Road Accidents')
- ax.legend(handles=[MA], prop={'size': 10.5}, loc='best', borderpad=1,
- labelcolor=clrs[0], edgecolor='white');
复制代码 在已往5年,美国全部的都会中,加利福尼亚州是事故排名最高的州,约占全部交通事故的比重为30%,均匀每天发生246起事故,意味着每小时约10起交通事故。
佛罗里达州是事故排名第二的州,约占全部交通事故的比重为10%。
- geometry = [Point(xy) for xy in zip(df['Start_Lng'], df['Start_Lat'])]
- geo_df = gpd.GeoDataFrame(df, geometry=geometry)
- geo_df['year'] = geo_df.Start_Time.dt.year
- geo_df['State'] = geo_df['State'].apply(convert)
复制代码- fig, ax = plt.subplots(figsize=(15,15))
- ax.set_xlim([-125,-65])
- ax.set_ylim([22,55])
- states.boundary.plot(ax=ax, color='grey');
- states.apply(lambda x: None if (x.NAME not in top_ten_states_name) else ax.annotate(s=x.NAME, xy=x.geometry.centroid.coords[0], ha='center', color='black', weight='bold', fontsize=12.5), axis=1);
- # CFOTNYMVNPI
- colors = ['#FF5252','#9575CD','#FF8A80','#FF4081','#FFEE58','#7C4DFF','#00E5FF','#81D4FA','#64FFDA','#8C9EFF']
- count = 0
- for i in list(state_df['State'].head(10)):
- geo_df[geo_df['State'] == i].plot(ax=ax, markersize=1, color=colors[count], marker='o');
- count += 1
- for i in ['bottom', 'top', 'left', 'right']:
- side = ax.spines[i]
- side.set_visible(False)
-
- plt.tick_params(top=False, bottom=False, left=False, right=False,
- labelleft=False, labelbottom=False)
- plt.title('\nVisualization of Top 10 Accident Prone States in US (2016-2020)', size=20, color='grey');
复制代码
- fig, ax = plt.subplots(figsize = (12,6), dpi = 80)
- cmap = cm.get_cmap('cool', 10)
- clrs = [matplotlib.colors.rgb2hex(cmap(i)) for i in range(cmap.N)]
- ax=sns.barplot(y=state_df['Cases'].tail(10), x=state_df['State'].tail(10), palette='cool')
- ax1 = ax.twinx()
- sns.lineplot(data = state_df[-10:], marker='o', x='State', y='Cases', color = 'white', alpha = .8)
- total = df.shape[0]
- for i in ax.patches:
- ax.text(i.get_x()-0.1, i.get_height()+100, \
- ' {:,d}\n({}%) '.format(int(i.get_height()), round(100*i.get_height()/total, 2)), fontsize=15,
- color='black')
- ax.set(ylim =(-50, 5000))
- ax1.set(ylim =(-50, 6000))
- plt.title('\nTop 10 States with least no. of \nAccident cases in US (2016-2020)\n', size=20, color='grey')
- ax1.axes.yaxis.set_visible(False)
- ax.set_xlabel('\nStates\n', fontsize=15, color='grey')
- ax.set_ylabel('\nAccident Cases\n', fontsize=15, color='grey')
- for i in ['top', 'right']:
- side = ax.spines[i]
- side.set_visible(False)
- side1 = ax1.spines[i]
- side1.set_visible(False)
-
-
- ax.set_axisbelow(True)
- ax.grid(color='#b2d6c7', linewidth=1, axis='y', alpha=.3)
- ax.spines['bottom'].set_bounds(0.005, 9)
- ax.spines['left'].set_bounds(0, 5000)
- ax1.spines['bottom'].set_bounds(0.005, 9)
- ax1.spines['left'].set_bounds(0, 5000)
- ax.tick_params(axis='y', which='major', labelsize=11)
- ax.tick_params(axis='x', which='major', labelsize=11, rotation=15)
- MI = mpatches.Patch(color=clrs[-1], label='State with Minimum\n no. of Road Accidents')
- ax.legend(handles=[MI], prop={'size': 10.5}, loc='best', borderpad=1,
- labelcolor=clrs[-1], edgecolor='white');
复制代码 在已往5年,美国全部的都会中,南达科他州是事故排名数量最低的都会,仅发生了213起事故,意味着均匀每年发生42起事故。
- timezone_df = pd.DataFrame(df['Timezone'].value_counts()).reset_index().rename(columns={'index':'Timezone', 'Timezone':'Cases'})
复制代码- fig, ax = plt.subplots(figsize = (10,6), dpi = 80)
- cmap = cm.get_cmap('spring', 4)
- clrs = [matplotlib.colors.rgb2hex(cmap(i)) for i in range(cmap.N)]
- ax=sns.barplot(y=timezone_df['Cases'], x=timezone_df['Timezone'], palette='spring')
- total = df.shape[0]
- for i in ax.patches:
- ax.text(i.get_x()+0.3, i.get_height()-50000, \
- '{}%'.format(round(i.get_height()*100/total)), fontsize=15,weight='bold',
- color='white')
-
- plt.ylim(-20000, 700000)
- plt.title('\nPercentage of Accident Cases for \ndifferent Timezone in US (2016-2020)\n', size=20, color='grey')
- plt.ylabel('\nAccident Cases\n', fontsize=15, color='grey')
- plt.xlabel('\nTimezones\n', fontsize=15, color='grey')
- plt.xticks(fontsize=13)
- plt.yticks(fontsize=12)
- for i in ['top', 'right']:
- side = ax.spines[i]
- side.set_visible(False)
-
- ax.set_axisbelow(True)
- ax.grid(color='#b2d6c7', linewidth=1, axis='y', alpha=.3)
- ax.spines['bottom'].set_bounds(0.005, 3)
- ax.spines['left'].set_bounds(0, 700000)
- MA = mpatches.Patch(color=clrs[0], label='Timezone with Maximum\n no. of Road Accidents')
- MI = mpatches.Patch(color=clrs[-1], label='Timezone with Minimum\n no. of Road Accidents')
- ax.legend(handles=[MA, MI], prop={'size': 10.5}, loc='best', borderpad=1,
- labelcolor=[clrs[0], 'grey'], edgecolor='white');
复制代码 从时区来看,美国东部时区交通事故案件最高,占全部事故案件比重为39%,山区时区数量最低,占比仅为6%。
- fig, ax = plt.subplots(figsize=(15,15))
- ax.set_xlim([-125,-65])
- ax.set_ylim([22,55])
- states.boundary.plot(ax=ax, color='black');
- colors = ['#00db49', '#ff5e29', '#88ff33', '#fffb29']
- #4132
- count = 0
- for i in list(timezone_df.Timezone):
- geo_df[geo_df['Timezone'] == i].plot(ax=ax, markersize=1, color=colors[count], marker='o', label=i);
- count += 1
- plt.legend(markerscale=10., prop={'size': 15}, edgecolor='white', title="Timezones", title_fontsize=15, loc='lower right');
- for i in ['bottom', 'top', 'left', 'right']:
- side = ax.spines[i]
- side.set_visible(False)
-
- plt.tick_params(top=False, bottom=False, left=False, right=False,
- labelleft=False, labelbottom=False)
- plt.title('\nVisualization of Road Accidents \nfor different Timezones in US (2016-2020)', size=20, color='grey');
复制代码
- street_df = pd.DataFrame(df['Street'].value_counts()).reset_index().rename(columns={'index':'Street No.', 'Street':'Cases'})
- top_ten_streets_df = pd.DataFrame(street_df.head(10))
复制代码- fig, ax = plt.subplots(figsize = (12,6), dpi = 80)
-
- cmap = cm.get_cmap('gnuplot2', 10)
- clrs = [matplotlib.colors.rgb2hex(cmap(i)) for i in range(cmap.N)]
- ax=sns.barplot(y=top_ten_streets_df['Cases'], x=top_ten_streets_df['Street No.'], palette='gnuplot2')
- ax1 = ax.twinx()
- sns.lineplot(data = top_ten_streets_df, marker='o', x='Street No.', y='Cases', color = 'white', alpha = .8)
- total = df.shape[0]
- for i in ax.patches:
- ax.text(i.get_x()+0.04, i.get_height()-2000, \
- '{:,d}'.format(int(i.get_height())), fontsize=12.5,weight='bold',
- color='white')
-
- ax.axes.set_ylim(-1000, 30000)
- ax1.axes.set_ylim(-1000, 40000)
- plt.title('\nTop 10 Accident Prone Streets in US (2016-2020)\n', size=20, color='grey')
- ax1.axes.yaxis.set_visible(False)
- ax.set_xlabel('\nStreet No.\n', fontsize=15, color='grey')
- ax.set_ylabel('\nAccident Cases\n', fontsize=15, color='grey')
- for i in ['top','right']:
- side1 = ax.spines[i]
- side1.set_visible(False)
- side2 = ax1.spines[i]
- side2.set_visible(False)
-
- ax.set_axisbelow(True)
- ax.grid(color='#b2d6c7', linewidth=1, axis='y', alpha=.3)
- ax.spines['bottom'].set_bounds(0.005, 9)
- ax.spines['left'].set_bounds(0, 30000)
- ax1.spines['bottom'].set_bounds(0.005, 9)
- ax1.spines['left'].set_bounds(0, 30000)
- ax.tick_params(axis='both', which='major', labelsize=12)
- MA = mpatches.Patch(color=clrs[1], label='Street with Maximum\n no. of Road Accidents')
- MI = mpatches.Patch(color=clrs[-2], label='Street with Minimum\n no. of Road Accidents')
- ax.legend(handles=[MA, MI], prop={'size': 10.5}, loc='best', borderpad=1,
- labelcolor=[clrs[1], 'grey'], edgecolor='white');
复制代码
在已往5年,美国全部的街道中,I-5 N 号街道事故记录最高,均匀每天发生14起事故。
- def street_cases_percentage(val, operator):
- if operator == '=':
- val = street_df[street_df['Cases']==val].shape[0]
- elif operator == '>':
- val = street_df[street_df['Cases']>val].shape[0]
- elif operator == '<':
- val = street_df[street_df['Cases']<val].shape[0]
- print('{:,d} Streets, {}%'.format(val, round(val*100/street_df.shape[0], 2)))
-
-
- street_cases_percentage(1, '=')
- street_cases_percentage(100, '<')
- street_cases_percentage(1000, '<')
- street_cases_percentage(1000, '>')
- street_cases_percentage(5000, '>')
复制代码 在已往5年,美国有93048条街道发买卖外事故。此中,36441条街道(39%)在已往5年只有1起事故;98%的街道事故少于100起;街道事故凌驾1000起的仅占0.2%。有24条街道事故超5000起。
- severity_df = pd.DataFrame(df['Severity'].value_counts()).rename(columns={'index':'Severity', 'Severity':'Cases'})
- fig = go.Figure(go.Funnelarea(
- text = ["Severity - 2","Severity - 3", "Severity - 4", "Severity - 1"],
- values = severity_df.Cases,
- title = {"position": "top center",
- "text": "<b>Impact on the Traffic due to the Accidents</b>",
- 'font':dict(size=18,color="#7f7f7f")},
- marker = {"colors": ['#14a3ee', '#b4e6ee', '#fdf4b8', '#ff4f4e'],
- "line": {"color": ["#e8e8e8", "wheat", "wheat", "wheat"], "width": [7, 0, 0, 2]}}
- ))
- fig.show()
复制代码
在已往5年中,有80%事故对交通影响为中等,严峻影响的仅占7.5%。
- fig, ax = plt.subplots(figsize=(15,15))
- ax.set_xlim([-125,-65])
- ax.set_ylim([22,55])
- states.boundary.plot(ax=ax, color='black');
- geo_df[geo_df['Severity'] == 1].plot(ax=ax, markersize=50, color='#5cff4a', marker='o', label='Severity 1');
- geo_df[geo_df['Severity'] == 3].plot(ax=ax, markersize=10, color='#ff1c1c', marker='x', label='Severity 3');
- geo_df[geo_df['Severity'] == 4].plot(ax=ax, markersize=1, color='#6459ff', marker='v', label='Severity 4');
- geo_df[geo_df['Severity'] == 2].plot(ax=ax, markersize=5, color='#ffb340', marker='+', label='Severity 2');
- for i in ['bottom', 'top', 'left', 'right']:
- side = ax.spines[i]
- side.set_visible(False)
-
- plt.tick_params(top=False, bottom=False, left=False, right=False,
- labelleft=False, labelbottom=False)
- plt.title('\nDifferent level of Severity visualization in US map', size=20, color='grey');
- One = mpatches.Patch(color='#5cff4a', label='Severity 1')
- Two = mpatches.Patch(color='#ffb340', label='Severity 2')
- Three = mpatches.Patch(color='#ff1c1c', label='Severity 3')
- Four = mpatches.Patch(color='#6459ff', label='Severity 4')
- ax.legend(handles=[One, Two, Three, Four], prop={'size': 15}, loc='lower right', borderpad=1,
- labelcolor=['#5cff4a', '#ffb340', '#ff1c1c', '#6459ff'], edgecolor='white');
复制代码
- accident_duration_df = pd.DataFrame(df['End_Time'] - df['Start_Time']).reset_index().rename(columns={'index':'Id', 0:'Duration'})
- top_10_accident_duration_df = pd.DataFrame(accident_duration_df['Duration'].value_counts().head(10).sample(frac = 1)).reset_index().rename(columns={'index':'Duration', 'Duration':'Cases'})
- Duration = [str(i).split('days')[-1].strip() for i in top_10_accident_duration_df.Duration]
- top_10_accident_duration_df['Duration'] = Duration
复制代码- fig, ax = plt.subplots(figsize = (12,6), dpi = 80)
- ax.set_facecolor('#e6f2ed')
- fig.patch.set_facecolor('#e6f2ed')
- cmap = cm.get_cmap('bwr', 10)
- clrs = [matplotlib.colors.rgb2hex(cmap(i)) for i in range(cmap.N)]
- ax=sns.barplot(y=top_10_accident_duration_df['Cases'], x=top_10_accident_duration_df['Duration'], palette='bwr')
- ax1 = ax.twinx()
- sns.lineplot(data = top_10_accident_duration_df, marker='o', x='Duration', y='Cases', color = 'white', alpha = 1)
- total = df.shape[0]
- for i in ax.patches:
- ax.text(i.get_x(), i.get_height()+5000, \
- str(round((i.get_height()/total)*100, 2))+'%', fontsize=15,
- color='black')
- ax.set(ylim =(1000, 400000))
- ax1.set(ylim =(1000, 500000))
- plt.title('\nMost Impacted Durations on the \nTraffic flow due to the Accidents \n', size=20, color='grey')
- ax1.axes.yaxis.set_visible(False)
- ax.set_xlabel('\nDuration of Accident (HH:MM:SS)\n', fontsize=15, color='grey')
- ax.set_ylabel('\nAccident Cases\n', fontsize=15, color='grey')
- for i in ['bottom', 'top', 'left', 'right']:
- ax.spines[i].set_color('white')
- ax.spines[i].set_linewidth(1.5)
- ax1.spines[i].set_color('white')
- ax1.spines[i].set_linewidth(1.5)
-
- ax.set_axisbelow(True)
- ax.grid(color='white', linewidth=1.5)
- ax.tick_params(axis='both', which='major', labelsize=12)
- MA = mpatches.Patch(color=clrs[-3], label='Duration with Maximum\n no. of Road Accidents')
- ax.legend(handles=[MA], prop={'size': 10.5}, loc='best', borderpad=1,
- labelcolor=clrs[-3], facecolor='#e6f2ed', edgecolor='#e6f2ed');
复制代码
从上图可以推断,大部分(24.25%)门路交通事故对交通流量的影响持续了6小时。
- year_df = pd.DataFrame(df.Start_Time.dt.year.value_counts()).reset_index().rename(columns={'index':'Year', 'Start_Time':'Cases'}).sort_values(by='Cases', ascending=True)
复制代码- fig, ax = plt.subplots(figsize = (12,6), dpi = 80)
- ax=sns.barplot(y=year_df['Cases'], x=year_df['Year'], palette=['#9a90e8', '#5d82de', '#3ee6e0', '#40ff53','#2ee88e'])
- total = df.shape[0]
- for i in ax.patches:
- ax.text(i.get_x()+0.2, i.get_height()-50000, \
- str(round((i.get_height()/total)*100, 2))+'%', fontsize=15,weight='bold',
- color='white')
- plt.ylim(10000, 900000)
- plt.title('\nRoad Accident Percentage \nover past 5 Years in US (2016-2020)\n', size=20, color='grey')
- plt.ylabel('\nAccident Cases\n', fontsize=15, color='grey')
- plt.xlabel('\nYears\n', fontsize=15, color='grey')
- plt.xticks(fontsize=13)
- plt.yticks(fontsize=12)
- for i in ['bottom', 'top', 'left', 'right']:
- ax.spines[i].set_color('white')
- ax.spines[i].set_linewidth(1.5)
-
- for k in ['top', 'right', "bottom", 'left']:
- side = ax.spines[k]
- side.set_visible(False)
- ax.set_axisbelow(True)
- ax.grid(color='#b2d6c7', linewidth=1, axis='y', alpha=0.3)
- MA = mpatches.Patch(color='#2ee88e', label='Year with Maximum\n no. of Road Accidents')
- MI = mpatches.Patch(color='#9a90e8', label='Year with Minimum\n no. of Road Accidents')
- ax.legend(handles=[MA, MI], prop={'size': 10.5}, loc='best', borderpad=1,
- labelcolor=['#2ee88e', '#9a90e8'], edgecolor='white');
- plt.show()
复制代码
从上图可以看出,在已往 5 年(2016-2020 年)中,美国的事故百分比显着增加,有 70% 仅发生在已往 2 年(2019 年、2020 年)内。
- fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(nrows=3, ncols=2, figsize=(15, 10))
- fig.suptitle('Accident Cases over the past 5 years in US', fontsize=20,fontweight ="bold", color='grey')
- count = 0
- years = ['2016', '2017', '2018', '2019', '2020']
- colors = ['#77fa5a', '#ffff4d', '#ffab36', '#ff894a', '#ff513b']
- for i in [ax1, ax2, ax3, ax4, ax5]:
- i.set_xlim([-125,-65])
- i.set_ylim([22,55])
- states.boundary.plot(ax=i, color='black');
- geo_df[geo_df['year']==int(years[count])].plot(ax=i, markersize=1, color=colors[count], marker='+', alpha=0.5)
- for j in ['bottom', 'top', 'left', 'right']:
- side = i.spines[j]
- side.set_visible(False)
- i.set_title(years[count] + '\n({:,} Road Accident Cases)'.format(list(year_df.Cases)[count]), fontsize=12, color='grey', weight='bold')
- i.axis('off')
- count += 1
-
- sns.lineplot(data = year_df, marker='o', x='Year', y='Cases', color = '#734dff', ax=ax6, label="Yearly Road Accidents");
- for k in ['bottom', 'top', 'left', 'right']:
- side = ax6.spines[k]
- side.set_visible(False)
- ax6.xaxis.set_ticks(year_df.Year);
- ax6.legend(prop={'size': 12}, loc='best', edgecolor='white');
复制代码
- accident_severity_df = geo_df.groupby(['year', 'Severity']).size().unstack()
复制代码- ax = accident_severity_df.plot(kind='barh', stacked=True, figsize=(12, 6),
- color=['#fcfa5d', '#ffe066', '#fab666', '#f68f6a'],
- rot=0);
- ax.set_title('\nSeverity and Corresponding Accident \nPercentage for past 5 years in US\n', fontsize=20, color='grey');
- for i in ['top', 'left', 'right']:
- side = ax.spines[i]
- side.set_visible(False)
-
- ax.spines['bottom'].set_bounds(0, 800000);
- ax.set_ylabel('\nYears\n', fontsize=15, color='grey');
- ax.set_xlabel('\nAccident Cases\n', fontsize=15, color='grey');
- ax.legend(prop={'size': 12.5}, loc='best', fancybox = True, title="Severity", title_fontsize=15, edgecolor='white');
- ax.tick_params(axis='both', which='major', labelsize=12.5)
- #ax.set_facecolor('#e6f2ed')
-
- for p in ax.patches:
- width, height = p.get_width(), p.get_height()
- x, y = p.get_xy()
- var = width*100/df.shape[0]
- if var > 0:
- if var > 4:
- ax.text(x+width/2,
- y+height/2-0.05,
- '{:.2f}%'.format(width*100/df.shape[0]),
- fontsize=12, color='black', alpha= 0.8)
- elif var > 1.8 and var < 3.5:
- ax.text(x+width/2-17000,
- y+height/2-0.05,
- '{:.2f}%'.format(width*100/df.shape[0]),
- fontsize=12, color='black', alpha= 0.8)
- elif var>1.5 and var<1.8:
- ax.text(x+width/2+7000,
- y+height/2-0.05,
- ' {:.2f}%'.format(width*100/df.shape[0]),
- fontsize=12, color='black', alpha= 0.8)
- elif var>1:
- ax.text(x+width/2-20000,
- y+height/2-0.05,
- ' {:.2f}%'.format(width*100/df.shape[0]),
- fontsize=12, color='black', alpha= 0.8)
- else:
- ax.text(x+width/2+10000,
- y+height/2-0.05,
- ' {:.2f}%'.format(width*100/df.shape[0]),
- fontsize=12, color='black', alpha= 0.8)
复制代码 已往4年(2017-2020年),美国高度严峻的意外个案维持在1.55%至1.8%之间,仅在 2020 年发生的已往 5 年门路交通事故总数中,有 45% 是中度严峻。
小结
本文旨在测试大型数据集在家用电脑读入内存的上限,但数据量有限未能测试出结果,顺便研究了一下表现美国地图的模块,很费时间,表现中文也有问题,末了只能用翻译软件转为英文给大家展示,感兴趣的朋友可以继续研究,反正我是要放弃这款了。
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。 |