An exploratory statistical analysis of the 2014 World Cup Final

栏目: IT技术 · 发布时间: 4年前

As a fan of both data and soccer - from here on out referred to as football - I find the football fan 's attitude toward statistics and data analysis perplexing, although understandable due to years and years of simple stats being the only thing that the media focuses on. Football is a complex team sport with deep interactions, therefore counting events (goals, assists, tackles, etc.) isn' t enough . This notebook shows how you can use play - by - play data to analyse a football match , showing custom measures and visualizations to better understand the sport . Disclaimer : I 'm a fan, not an expert. [Germany' s National Team ] ( http : // www . telegraph . co . uk / technology / news / 10959864 / Germanys - World - Cup - tactics - shaped - by - data . html ) and [ Manchester City ] ( http : // www . wired . co . uk / magazine / archive / 2014 / 01 / features / the - winning - formula ) have whole teams dedicated to data analysis , and the state of the art is quite above what is being shown here . However , rarely does that analysis is made public , so I hope this is useful ( or at least entertaining ) . I hope to keep playing with the data and share useful insights in the future . Feel free to star the GitHub repository , or drop me at email at rjtavares@gmail . com . ### A note on the data used This play - by - play data was gathered from a public website , and I have no guarantee that it is consistent or correct . The process used to gather will be the theme of its own post , so stay tuned . On the other hand , all calculations based on the raw data are included in this notebook , and should be questioned . I would love to get some feedback .

### Some preprocessing

Module imports and data preparation . You can ignore this section unless you want to play with the data yourself .

import pandas as pd import numpy as np import matplotlib . pyplot as plt from footyscripts . footyviz import draw_events , draw_pitch , type_names #plotting settings % matplotlib inline pd . options . display . mpl_style = 'default'

df = pd . read_csv ( "../datasets/germany-vs-argentina-731830.csv" , encoding = 'utf-8' , index_col = 0 ) #standard dimensions x_size = 105.0 y_size = 68.0 box_height = 16.5 * 2 + 7.32 box_width = 16.5 y_box_start = y_size / 2 - box_height / 2 y_box_end = y_size / 2 + box_height / 2 df [ 'x' ] = df [ 'x' ] / 100 * x_size df [ 'y' ] = df [ 'y' ] / 100 * y_size df [ 'to_x' ] = df [ 'to_x' ] / 100 * x_size df [ 'to_y' ] = df [ 'to_y' ] / 100 * y_size #creating some measures and classifiers from the original df [ 'count' ] = 1 df [ 'dx' ] = df [ 'to_x' ] - df [ 'x' ] df [ 'dy' ] = df [ 'to_y' ] - df [ 'y' ] df [ 'distance' ] = np . sqrt ( df [ 'dx' ] ** 2 + df [ 'dy' ] ** 2 ) df [ 'fivemin' ] = np . floor ( df [ 'min' ] / 5 ) * 5 df [ 'type_name' ] = df [ 'type' ] . map ( type_names . get ) df [ 'to_box' ] = ( df [ 'to_x' ] > x_size - box_width ) & ( y_box_start < df [ 'to_y' ] ) & ( df [ 'to_y' ] < y_box_end ) df [ 'from_box' ] = ( df [ 'x' ] > x_size - box_width ) & ( y_box_start < df [ 'y' ] ) & ( df [ 'y' ] < y_box_end ) df [ 'on_offense' ] = df [ 'x' ] > x_size / 2

dfPeriod1 = df [ df [ 'period' ] == 1 ] dfP1Shots = dfPeriod1 [ dfPeriod1 [ 'type' ] . isin ( [ 13 , 14 , 15 , 16 ] ) ] dfPeriod2 = df [ df [ 'period' ] == 2 ] dfP2Shots = dfPeriod2 [ dfPeriod2 [ 'type' ] . isin ( [ 13 , 14 , 15 , 16 ] ) ] dfExtraTime = df [ df [ 'period' ] > 2 ] dfETShots = dfExtraTime [ dfExtraTime [ 'type' ] . isin ( [ 13 , 14 , 15 , 16 ] ) ]

## The first half

Let 's get a quick profile of the first half. The chart below shows where in the field most events took place (positive numbers correspond to Germany' s offensive half , negative numbers to its defensive half ) , with each team's shots pointed out .

fig = plt . figure ( figsize = ( 12 , 4 ) ) avg_x = ( dfPeriod1 [ dfPeriod1 [ 'team_name' ] == 'Germany' ] . groupby ( 'min' ) . apply ( np . mean ) [ 'x' ] - dfPeriod1 [ dfPeriod1 [ 'team_name' ] == 'Argentina' ] . groupby ( 'min' ) . apply ( np . mean ) [ 'x' ] ) plt . stackplot ( list ( avg_x . index . values ) , list ( [ x if x > 0 else 0 for x in avg_x ] ) ) plt . stackplot ( list ( avg_x . index . values ) , list ( [ x if x < 0 else 0 for x in avg_x ] ) ) for i , shot in dfP1Shots . iterrows ( ) : x = shot [ 'min' ] y = avg_x . ix [ shot [ 'min' ] ] signal = 1 if shot [ 'team_name' ] == 'Germany' else - 1 plt . annotate ( s = ( shot [ 'type_name' ] + ' (' + shot [ 'team_name' ] [ 0 ] + ")" ) , xy = ( x , y ) , xytext = ( x - 5 , y + 30 * signal ) , arrowprops = dict ( facecolor = 'black' ) ) plt . gca ( ) . set_xlabel ( 'minute' ) plt . title ( "First Half Profile" )

The first 45 ' of the final were incredibly interesting. Germany dominated possession and pressured high, forcing Argentina to play in its own half. That is obvious once we look at Argentina' s passes during the first half :

draw_pitch ( ) draw_events ( dfPeriod1 [ ( dfPeriod1 [ 'type' ] == 1 ) & ( dfPeriod1 [ 'outcome' ] == 1 ) & ( dfPeriod1 [ 'team_name' ] == 'Argentina' ) ] , mirror_away = True ) plt . text ( x_size / 4 , - 3 , "Germany's defense" , color = 'black' , bbox = dict ( facecolor = 'white' , alpha = 0.5 ) , horizontalalignment = 'center' ) plt . text ( x_size * 3 / 4 , - 3 , "Argentina's defense" , color = 'black' , bbox = dict ( facecolor = 'white' , alpha = 0.5 ) , horizontalalignment = 'center' ) plt . title ( "Argentina's passes during the first half" )

dfPeriod1 . groupby ( 'team_name' ) . agg ( { 'x' : np . mean , 'on_offense' : np . mean } )

dfPeriod1 [ dfPeriod1 [ 'type' ] == 1 ] . groupby ( 'team_name' ) . agg ( { 'outcome' : np . mean } )

Only 28 % of Argentina's passes were made on its offensive half , versus 61 % for Germany . Despite playing in the offensive half , Germany managed to get a much higher passing accuracy , a testament to its amazing midfield . However , that superiority didn 't manifest itself in chances and shots. In fact, Germany had quite a difficult time trying to get inside Argentina' s penalty box .

draw_pitch ( ) draw_events ( df [ ( df [ 'to_box' ] == True ) & ( df [ 'type' ] == 1 ) & ( df [ 'from_box' ] == False ) & ( df [ 'period' ] == 1 ) & ( df [ 'outcome' ] == 1 ) ] , mirror_away = True ) draw_events ( df [ ( df [ 'to_box' ] == True ) & ( df [ 'type' ] == 1 ) & ( df [ 'from_box' ] == False ) & ( df [ 'period' ] == 1 ) & ( df [ 'outcome' ] == 0 ) ] , mirror_away = True , alpha = 0.2 ) draw_events ( dfP1Shots , mirror_away = True , base_color = '#a93e3e' ) plt . text ( x_size / 4 , - 3 , "Germany's defense" , color = 'black' , bbox = dict ( facecolor = 'white' , alpha = 0.5 ) , horizontalalignment = 'center' ) plt . text ( x_size * 3 / 4 , - 3 , "Argentina's defense" , color = 'black' , bbox = dict ( facecolor = 'white' , alpha = 0.5 ) , horizontalalignment = 'center' )

dfPeriod1 [ ( dfPeriod1 [ 'to_box' ] == True ) & ( dfPeriod1 [ 'from_box' ] == False ) & ( df [ 'type' ] == 1 ) ] . groupby ( [ 'team_name' ] ) . agg ( { 'outcome' : np . mean , 'count' : np . sum } )

Out of 16 German atempts to get in the box , only one resulted in a shot : a late corner , with Howedes hitting the post .

## The curious case of Christoph Kramer

Kramer suffered an injury on the 19th minute , but was only subsituted 12 minutes later . This included Germany's worst period in the first half , as the first half profile chart above shows . [ Reports say that he acted confused ] ( http : // www . theguardian . com / football / 2014 / jul / 17 / christoph - kramer - germany - concussion - world - cup - final - 2014 ) , and data shows that Kramer was largely "absent" in the period between the injury and the subsitutions : his only actions were one succesful reception and pass , and one loss of possesion .

dfKramer = df [ df [ 'player_name' ] == 'Christoph Kramer' ] pd . pivot_table ( dfKramer , values = 'count' , rows = 'type_name' , cols = 'min' , aggfunc = sum , fill_value = 0 )

dfKramer [ 'action' ] = dfKramer [ 'outcome' ] . map ( str ) + '-' + dfKramer [ 'type_name' ] dfKramer [ 'action' ] . unique ( )

score = { '1-LINEUP' : 0 , '1-RUN WITH BALL' : 0.5 , '1-RECEPTION' : 0 , '1-PASS' : 1 , '0-PASS' : - 1 , '0-TACKLE (NO CONTROL)' : 0 , '1-CLEAR BALL (OUT OF PITCH)' : 0.5 , '0-LOST CONTROL OF BALL' : - 1 , '1-SUBSTITUTION (OFF)' : 0 } dfKramer [ 'score' ] = dfKramer [ 'action' ] . map ( score . get )

dfKramer . groupby ( 'min' ) [ 'score' ] . sum ( ) . reindex ( range ( 32 ) , fill_value = 0 ) . plot ( kind = 'bar' ) plt . annotate ( 'Injury' , ( 19 , 0.5 ) , ( 14 , 1.1 ) , arrowprops = dict ( facecolor = 'black' ) ) plt . annotate ( 'Substitution' , ( 31 , 0 ) , ( 22 , 1.6 ) , arrowprops = dict ( facecolor = 'black' ) ) plt . gca ( ) . set_xlabel ( 'minute' ) plt . gca ( ) . set_ylabel ( 'no. events' )

## The second half

The second half was much more balanced . We reproduce the same charts as the first half , which confirm this perception .

fig = plt . figure ( figsize = ( 12 , 4 ) ) avg_x = ( dfPeriod2 [ dfPeriod2 [ 'team_name' ] == 'Germany' ] . groupby ( 'min' ) . apply ( np . mean ) [ 'x' ] - dfPeriod2 [ dfPeriod2 [ 'team_name' ] == 'Argentina' ] . groupby ( 'min' ) . apply ( np . mean ) [ 'x' ] ) plt . stackplot ( list ( avg_x . index . values ) , list ( [ x if x > 0 else 0 for x in avg_x ] ) ) plt . stackplot ( list ( avg_x . index . values ) , list ( [ x if x < 0 else 0 for x in avg_x ] ) ) for i , shot in dfP2Shots . iterrows ( ) : x = shot [ 'min' ] y = avg_x . ix [ shot [ 'min' ] ] signal = 1 if shot [ 'team_name' ] == 'Germany' else - 1 plt . annotate ( s = ( shot [ 'type_name' ] + ' (' + shot [ 'team_name' ] [ 0 ] + ")" ) , xy = ( x , y ) , xytext = ( x - 5 , y + 30 * signal ) , arrowprops = dict ( facecolor = 'black' ) ) plt . gca ( ) . set_xlabel ( 'minute' ) plt . title ( "Second Half Profile" )

dfPeriod2 . groupby ( 'team_name' ) . agg ( { 'x' : np . mean , 'on_offense' : np . mean } )

dfPeriod2 [ dfPeriod2 [ 'type' ] == 1 ] . groupby ( 'team_name' ) . agg ( { 'outcome' : np . mean } )

draw_pitch ( ) draw_events ( df [ ( df [ 'to_box' ] == True ) & ( df [ 'type' ] == 1 ) & ( df [ 'from_box' ] == False ) & ( df [ 'period' ] == 2 ) & ( df [ 'outcome' ] == 1 ) ] , mirror_away = True ) draw_events ( df [ ( df [ 'to_box' ] == True ) & ( df [ 'type' ] == 1 ) & ( df [ 'from_box' ] == False ) & ( df [ 'period' ] == 2 ) & ( df [ 'outcome' ] == 0 ) ] , mirror_away = True , alpha = 0.2 ) draw_events ( dfP2Shots , mirror_away = True , base_color = '#a93e3e' ) plt . text ( x_size / 4 , - 3 , "Germany's defense" , color = 'black' , bbox = dict ( facecolor = 'white' , alpha = 0.5 ) , horizontalalignment = 'center' ) plt . text ( x_size * 3 / 4 , - 3 , "Argentina's defense" , color = 'black' , bbox = dict ( facecolor = 'white' , alpha = 0.5 ) , horizontalalignment = 'center' )

Even though Germany had much more success getting inside the box on the second half , only only pass resulted in a German shot from inside the box .

dfPeriod2 [ ( dfPeriod2 [ 'to_box' ] == True ) & ( dfPeriod2 [ 'from_box' ] == False ) & ( df [ 'type' ] == 1 ) ] . groupby ( [ 'team_name' ] ) . agg ( { 'outcome' : np . mean , 'count' : np . sum } )

## The extra time

fig = plt . figure ( figsize = ( 12 , 4 ) ) avg_x = ( dfExtraTime [ dfExtraTime [ 'team_name' ] == 'Germany' ] . groupby ( 'min' ) . apply ( np . mean ) [ 'x' ] - dfExtraTime [ dfExtraTime [ 'team_name' ] == 'Argentina' ] . groupby ( 'min' ) . apply ( np . mean ) [ 'x' ] . reindex ( dfExtraTime [ 'min' ] . unique ( ) , fill_value = 0 ) ) plt . stackplot ( list ( avg_x . index . values ) , list ( [ x if x > 0 else 0 for x in avg_x ] ) ) plt . stackplot ( list ( avg_x . index . values ) , list ( [ x if x < 0 else 0 for x in avg_x ] ) ) for i , shot in dfETShots . iterrows ( ) : x = shot [ 'min' ] y = avg_x . ix [ shot [ 'min' ] ] signal = 1 if shot [ 'team_name' ] == 'Germany' else - 1 plt . annotate ( s = ( shot [ 'type_name' ] + ' (' + shot [ 'team_name' ] [ 0 ] + ")" ) , xy = ( x , y ) , xytext = ( x - 5 , y + 20 * signal ) , arrowprops = dict ( facecolor = 'black' ) ) plt . gca ( ) . set_xlabel ( 'minute' ) plt . title ( "Extra Time Profile" )

df . groupby ( [ 'team_name' , 'period' ] ) . agg ( { 'count' : np . sum , 'x' : np . mean , 'on_offense' : np . mean } )

We can see that Germany's 4th period was quite different from the rest of the match . Germany played a lot less high on the field , and decreased the number of passes significantly : showing that it tried to control the game , slow it down , dropping the defensive line . This is even more evident if we look at the time interval after the goal :

goal_ix = df [ df [ 'type' ] == 16 ] . index [ 0 ] df . ix [ goal_ix + 1 : ] . groupby ( [ 'team_name' , 'period' ] ) . agg ( { 'count' : np . sum , 'x' : np . mean , 'on_offense' : np . mean } )

draw_pitch ( ) draw_events ( df . ix [ goal_ix + 1 : ] [ ( df [ 'to_box' ] == True ) & ( df [ 'type' ] == 1 ) & ( df [ 'from_box' ] == False ) & ( df [ 'outcome' ] == 1 ) ] , mirror_away = True ) draw_events ( df . ix [ goal_ix + 1 : ] [ ( df [ 'to_box' ] == True ) & ( df [ 'type' ] == 1 ) & ( df [ 'from_box' ] == False ) & ( df [ 'outcome' ] == 0 ) ] , mirror_away = True , alpha = 0.2 ) draw_events ( df . ix [ goal_ix + 1 : ] [ df [ 'type' ] . isin ( [ 13 , 14 , 15 , 16 ] ) ] , mirror_away = True , base_color = '#a93e3e' ) plt . text ( x_size / 4 , - 3 , "Germany's defense" , color = 'black' , bbox = dict ( facecolor = 'white' , alpha = 0.5 ) , horizontalalignment = 'center' ) plt . text ( x_size * 3 / 4 , - 3 , "Argentina's defense" , color = 'black' , bbox = dict ( facecolor = 'white' , alpha = 0.5 ) , horizontalalignment = 'center' )

df . ix [ goal_ix + 1 : ] [ df . ix [ goal_ix + 1 : ] [ 'type' ] . isin ( [ 13 , 14 , 15 , 16 ] ) ] [ [ 'min' , 'player_name' , 'team_name' , 'type_name' ] ]

Germany completely gave up trying to score another goal , with only one attempt at a pass to the box . However , its defensive strategy was successful , with Argentina barely entering Germany's penalty box . Its only 2 shots came from outside the box , both by Messi , who at this point was probably feeling somewhat desperate .

## The goal

goal = df [ df [ 'type' ] == 16 ] . index [ 0 ] dfGoal = df . ix [ goal - 30 : goal ] draw_pitch ( ) draw_events ( dfGoal [ dfGoal . team_name == 'Germany' ] , base_color = 'white' ) draw_events ( dfGoal [ dfGoal . team_name == 'Argentina' ] , base_color = 'cyan' )

#Germany's players involved in the play dfGoal [ 'progression' ] = dfGoal [ 'to_x' ] - dfGoal [ 'x' ] dfGoal [ dfGoal [ 'type' ] . isin ( [ 1 , 101 , 16 ] ) ] [ [ 'player_name' , 'type_name' , 'progression' ] ]

## Some basic stats

#passing accuracy df . groupby ( [ 'player_name' , 'team_name' ] ) . agg ( { 'count' : np . sum , 'outcome' : np . mean } ) . sort ( 'count' , ascending = False )

#shots pd . pivot_table ( df [ df [ 'type' ] . isin ( [ 13 , 14 , 15 , 16 ] ) ] , values = 'count' , aggfunc = sum , rows = [ 'player_name' , 'team_name' ] , cols = 'type_name' , fill_value = 0 , margins = True ) . sort ( 'All' , ascending = False )

#defensive play pd . pivot_table ( df [ df [ 'type' ] . isin ( [ 7 , 8 , 49 ] ) ] , values = 'count' , aggfunc = np . sum , rows = [ 'player_name' , 'team_name' ] , cols = 'type_name' , fill_value = 0 , margins = True ) . sort ( 'All' , ascending = False )

In short : Kroos and Schweinsteiger were imense in Germany 's midield. Kroos had the most defensive actions, the second most shots (behind Messi only due to his desperate late atempts), and the second most passes. He was also responsible for most of the progression in Germany' s goal - leading play . He was objectively the man of the match . Why FIFA decided to give that award to Goetze is beyond me ( and most football fans ) .


以上所述就是小编给大家介绍的《An exploratory statistical analysis of the 2014 World Cup Final》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

绝对价值

绝对价值

[美] 伊塔马尔·西蒙森 艾曼纽·罗森 / 钱峰 / 中国友谊出版公司 / 2014-7 / 45.00元

绝对价值指的是经用户体验的产品质量,即使用某件产品或者享受某项服务的切实感受。 过去,消费就像是押宝。一件商品好不好,一家餐馆的环境如何,没有亲身体验过消费者无从得知,只能根据营销人员提供的有限信息去猜测。品牌、原产地、价位、广告,这些重要的质量线索左右着消费者的选择。 然而,互联网和新兴科技以一种前所未有的速度改变了商业环境。当消费者可以在购买前查看到交易记录和消费者评价,通过便捷的......一起来看看 《绝对价值》 这本书的介绍吧!

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码

SHA 加密
SHA 加密

SHA 加密工具