BigQuery ML and BigQuery GIS: used together to predict NYC taxi trip cost

栏目: 编程工具 · 发布时间: 6年前

BigQuery ML and BigQuery GIS: used together to predict NYC taxi trip cost

BigQuery ML and BigQuery GIS: used together to predict NYC taxi trip cost

admin GoogleCloud No comments

Source: BigQuery ML and BigQuery GIS: used together to predict NYC taxi trip cost from Google Cloud

In this article, I’ll walk you through the process of building a machine learning model using BigQuery ML . As a bonus, we’ll have the chance to use BigQuery’s support for spatial functions.

We’ll use the New York City taxicab dataset , with the goal of predicting taxi fare, given both pick-up and drop-off locations for each ride — imagine that we are designing a trip planner.

Create a training dataset

The first step is to set up a machine learning dataset. In BigQuery, we simply write this query:

Note a few things about the query:

  1. The main part of the query is at the bottom: ( SELECT * from taxitrips )

  2. taxitrips does the bulk of the extraction for the NYC dataset, with the SELECT containing my training features and label.

  3. The WHERE removes data that I don’t want to train on.

  4. The WHERE also includes a sampling clause to pick up only 1/1000th of the data

  5. I define a variable called TRAIN so that I can quickly build an independent EVAL set. Note that BigQuery will automatically split the TRAIN data into two parts, and use one part of the training dataset to do things like early stopping and learning rate exploration. I am creating an independent evaluation dataset that I will not show to BigQuery during training.

Training the model

Once I have a query to create the training dataset, I can now train the model by prepending a few lines to the creation query:

Note a few things about the above query:

  1. CREATE model is a safe way to ensure that you don’t overwrite existing models. CREATE or REPLACE will … replace existing models.

  2. I specify my model type. Use linear_reg for regression problems and logistic_reg for classification problems.

  3. I specify that the total_fare column is the label.

  4. I ask that model training stop when the improvement is < 0.5% (this is optional, but shows you how to specify any optional parameters).

Running the query takes about 5 minutes on the 1-million row training dataset. Pause for a minute and take that in: it only takes 5 minutes to train an ML model on 1 million rows!

Evaluating the model

When the model is trained, the training loss is written out iteration-by-iteration to a table. We can plot it using Pandas (see my notebook on GitHub ):

BigQuery ML and BigQuery GIS: used together to predict NYC taxi trip cost

The training loss is not especially interesting, though. What we want is to evaluate the model on an independent dataset. We can do that by changing the TRAIN to EVAL in the training dataset query and computing the RMSE ( root-mean-square error ) as follows:

The important idea here is that you run ML.PREDICT to pass in the trained model, and then issue a select statement consisting of the rows on which you want to evaluate. Since my label is called ‘ total_amount ’, ML.PREDICT will provide me a ‘ predicted_total_amount ’. I can use that to compute the RMSE.

In this case, my model returns a RMSE of $9.57. Can we do better?

Faceted evaluation

We can write a more sophisticated evaluation that computes the mean absolute percent error (MAPE) and group it by the taxi fare to see how the error varies with amount:

Plotting the MAPE by the original amount gives us:

BigQuery ML and BigQuery GIS: used together to predict NYC taxi trip cost

As you can see, we have serious problems, because  our error increases quadratically on either side of the mean.

I think we can do better.

Feature engineering with spatial and temporal features

Let’s teach the model that the Euclidean distance between the pick-up and drop-off points is important. We can use the spatial distance as an input feature (BQ GIS and BQ Geo Viz are both currently in public alpha. To request access, fill out this form ):

Also, let’s allow the model to learn traffic patterns by creating a new feature that combines the time of day and day of week (this is called a feature cross). We can do that by:

CONCAT(dayofweek, CAST(hourofday AS STRING)) AS dayhr_fc

Finally, let’s feature cross the pick-up and drop-off locations so that the model can learn pick-up-drop-off pairs that will require tolls:

CONCAT(ST_AsText(ST_SnapToGrid(pickup, 0.1)),
       ST_AsText(ST_SnapToGrid(dropoff, 0.1))) AS loc_fc

This step takes the geographic point corresponding to the pickup point and grids to a 0.1-degree-latitude/longitude grid (approximately 8km x 11km in New York—we should experiment with finer resolution grids as well). Then, it concatenates the pickup and dropoff grid points to learn “corrections” beyond the Euclidean distance associated with pairs of pickup and dropoff locations.

Here’s the full query that runs all three of the above steps:

Notice also that I have greatly expanded the WHERE clause to limit the data to taxi-trips — data cleanup is very important!

The new model achieves a RMSE of $5.08, dropping the error by nearly 40%! Here is the training query and here is the evaluation query .

The faceted evaluation also shows that the new model has nearly constant MAPE by fare amount once we get into reasonably long rides (rides of less than $7.50 will presumably require finer feature crosses):

BigQuery ML and BigQuery GIS: used together to predict NYC taxi trip cost

Mapping the evaluation results

Instead of grouping by the total amount, we can group by a spatial feature. Let’s look at how the taxi fare error varies depending on the drop-off point:

Essentially, I am computing the mean absolute percent error by grouping based on the dropoff gridpoint. I then plotted it using the BigQuery Geo Viz (you will get a link to the tool when your project gets whitelisted):

BigQuery ML and BigQuery GIS: used together to predict NYC taxi trip cost

Essentially, I am computing the mean absolute percent error by grouping based on the dropoff gridpoint. I then plotted it using the BigQuery Geo Viz (you will get a link to the tool when your project gets whitelisted):

Filtering on frequent drop-off areas and adjusting the color scale, we get:

BigQuery ML and BigQuery GIS: used together to predict NYC taxi trip cost

The larger errors correspond to out-of-town trips to Westchester and Jersey. It appears that such trips incur surcharges that the model hasn’t learned.

To learn more

  1. Check out my notebook that includes full code on GitHub . (also includes full workflow, graphs, etc.)

  2. The training query (uses CREATE MODEL )

  3. The evaluation query (uses ML.EVALUATE )

  4. The faceted evaluation (uses ML.PREDICT )

Enjoy!

除非特别声明,此文章内容采用 知识共享署名 3.0 许可,代码示例采用 Apache 2.0 许可。更多细节请查看我们的 服务条款

Tags: Cloud


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

CSS揭秘

CSS揭秘

[希] Lea Verou / CSS魔法 / 人民邮电出版社 / 2016-4 / 99.00元

本书是一本注重实践的教程,作者为我们揭示了 47 个鲜为人知的 CSS 技巧,主要内容包括背景与边框、形状、 视觉效果、字体排印、用户体验、结构与布局、过渡与动画等。本书将带领读者循序渐进地探寻更优雅的解决方案,攻克每天都会遇到的各种网页样式难题。 本书的读者对象为前端工程师、网页开发人员。一起来看看 《CSS揭秘》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码