Python环境安装及数据基本预处理-大数据ML样本集案例实战

栏目: Python · 发布时间: 6年前

内容简介：版权声明：本套技术专栏是作者（秦凯新）平时工作的总结和升华，通过从真实商业环境抽取案例进行总结和分享，并给出商业应用的调优建议和集群环境容量规划等内容，请持续关注本套博客。QQ邮箱地址：1120746959@qq.com，如有任何学术交流，可随时联系。版权声明：本套技术专栏是作者（秦凯新）平时工作的总结和升华，通过从真实商业环境抽取案例进行总结和分享，并给出商业应用的调优建议和集群环境容量规划等内容，请持续关注本套博客。QQ邮箱地址：1120746959@qq.com，如有任何学术交流，可随时联系。版权声

1 Python 环境安装

shift + Enter :换行
ctrl + Enter ：执行

2 Python IDE 环境安装

3 数据预处理

头几行展示

import numpy as np 
  import pandas as pd 
  import matplotlib.pyplot as plt
  %matplotlib inline
  
  from sklearn.ensemble import RandomForestClassifier
  from sklearn.cross_validation import KFold
  
  # import data
  filename= "C:\\ML\\MLData\\data.csv"
  raw = pd.read_csv(filename)
  print (raw.shape)
  raw.head()
复制代码

尾几行展示
去除空值

matplot列属性绘制分布

#plt.subplot(211) first is raw second Column
  # 透明程度 （颜色深度和密度）
  alpha = 0.02
  # 指定图大概占用的区域
  plt.figure(figsize=(10,10))
  # loc_x and loc_y（一行两列第一个位置）
  plt.subplot(121)
  # scatter 散点图
  plt.scatter(kobe.loc_x, kobe.loc_y, color='R', alpha=alpha)
  plt.title('loc_x and loc_y')
  # lat and lon（一行两列第二个位置）
  plt.subplot(122)
  plt.scatter(kobe.lon, kobe.lat, color='B', alpha=alpha)
  plt.title('lat and lon')
复制代码

角度和极坐标预处理

raw['dist'] = np.sqrt(raw['loc_x']**2 + raw['loc_y']**2)
  loc_x_zero = raw['loc_x'] == 0
  #print (loc_x_zero)
  raw['angle'] = np.array([0]*len(raw))
  raw['angle'][~loc_x_zero] = np.arctan(raw['loc_y'][~loc_x_zero] / raw['loc_x'][~loc_x_zero])
  raw['angle'][loc_x_zero] = np.pi / 2 
复制代码

时间处理

raw['remaining_time'] = raw['minutes_remaining'] * 60 + raw['seconds_remaining']
复制代码

属性唯一值及分组统计打印出来

投篮方式
  print(kobe.action_type.unique())
  print(kobe.combined_shot_type.unique())
  print(kobe.shot_type.unique())
  分组统计
  print(kobe.shot_type.value_counts())
复制代码

按列进行特殊符号处理

kobe['season'].unique()  
  
  array(['2000-01', '2001-02', '2002-03', '2003-04', '2004-05', '2005-06',
         '2006-07', '2007-08', '2008-09', '2009-10', '2010-11', '2011-12',
         '2012-13', '2013-14', '2014-15', '2015-16', '1996-97', '1997-98',
         '1998-99', '1999-00'], dtype=object)

  raw['season'] = raw['season'].apply(lambda x: int(x.split('-')[1]) )
  raw['season'].unique()
  
  array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 97,
        98, 99,  0], dtype=int64)
复制代码

pd的DataFrame使用技巧（matchup两队对决，opponent对手是谁）

pd.DataFrame({'matchup':kobe.matchup, 'opponent':kobe.opponent})
复制代码

属性相关性展示是否是线性关系（位置和投篮位置）

plt.figure(figsize=(5,5))
  
  plt.scatter(raw.dist, raw.shot_distance, color='blue')
  plt.title('dist and shot_distance')
复制代码

pd的groupby对kebe的投射位置进行分组

gs = kobe.groupby('shot_zone_area')
  print (kobe['shot_zone_area'].value_counts())
  print (len(gs))
  
  Center(C)                11289
  Right Side Center(RC)     3981
  Right Side(R)             3859
  Left Side Center(LC)      3364
  Left Side(L)              3132
  Back Court(BC)              72
  Name: shot_zone_area, dtype: int64
  6
复制代码

区域划分拉链展示

import matplotlib.cm as cm
  plt.figure(figsize=(20,10))
  
  def scatter_plot_by_category(feat):
      alpha = 0.1
      gs = kobe.groupby(feat)
      cs = cm.rainbow(np.linspace(0, 1, len(gs)))
      for g, c in zip(gs, cs):
          plt.scatter(g[1].loc_x, g[1].loc_y, color=c, alpha=alpha)
  
  # shot_zone_area
  plt.subplot(131)
  scatter_plot_by_category('shot_zone_area')
  plt.title('shot_zone_area')
  
  # shot_zone_basic
  plt.subplot(132)
  scatter_plot_by_category('shot_zone_basic')
  plt.title('shot_zone_basic')
  
  # shot_zone_range
  plt.subplot(133)
  scatter_plot_by_category('shot_zone_range')
  plt.title('shot_zone_range')
复制代码

去除某一列

drops = ['shot_id', 'team_id', 'team_name', 'shot_zone_area', 'shot_zone_range', 'shot_zone_basic', \
           'matchup', 'lon', 'lat', 'seconds_remaining', 'minutes_remaining', \
           'shot_distance', 'loc_x', 'loc_y', 'game_event_id', 'game_id', 'game_date']
  for drop in drops:
      raw = raw.drop(drop, 1)
复制代码

独热编码（one-hot编码）（一列变多列（0000000）prefix指定添加列前缀）

print (raw['combined_shot_type'].value_counts())
  pd.get_dummies(raw['combined_shot_type'], prefix='combined_shot_type')[0:2]
  
  Jump Shot    23485
  Layup         5448
  Dunk          1286
  Tip Shot       184
  Hook Shot      153
  Bank Shot      141
  Name: combined_shot_type, dtype: int64
复制代码

独热编码之后，拼接成1列后，删除对应列。

categorical_vars = ['action_type', 'combined_shot_type', 'shot_type', 'opponent', 'period', 'season']
  for var in categorical_vars:
      raw = pd.concat([raw, pd.get_dummies(raw[var], prefix=var)], 1)
      raw = raw.drop(var, 1)
复制代码

总结

综上所述， numpy与pandas与matplotlit与sklearn四剑客组成了强大的数据分析预处理支持。

秦凯新于深圳 201812081439

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

亿级流量网站架构核心技术

张开涛 / 电子工业出版社 / 2017-4 / 99

《亿级流量网站架构核心技术》一书总结并梳理了亿级流量网站高可用和高并发原则，通过实例详细介绍了如何落地这些原则。本书分为四部分：概述、高可用原则、高并发原则、案例实战。从负载均衡、限流、降级、隔离、超时与重试、回滚机制、压测与预案、缓存、池化、异步化、扩容、队列等多方面详细介绍了亿级流量网站的架构核心技术，让读者看后能快速运用到实践项目中。不管是软件开发人员，还是运维人员，通过阅读《亿级流......一起来看看《亿级流量网站架构核心技术》这本书的介绍吧!

码农工具