Cage Match: XGBoost vs. Keras Deep Learning

栏目: IT技术 · 发布时间: 4年前

内容简介:Ever since I had my first taste of deep learning I have been interested in applying it to structured, tabular data. I have writtenThe idea of using deep learning on tabular data isTo illustrate the key points of the book, I created a deep learning approach

Ever since I had my first taste of deep learning I have been interested in applying it to structured, tabular data. I have written several articles on the subject and I am writing a book on Deep Learning with Structured Data for Manning Publications. It would be great to tackle problems with structured tabular data by harnessing deep learning’s flexibility and potential for reduced feature engineering.

The idea of using deep learning on tabular data is not without its critics . A consistent objection I have heard is that non-deep learning approaches, XGBoost in particular, are simpler to code, easier to interpret, and have better performance. I decided that I needed to put this assertion to the test with the major example from my book: predicting delays on the Toronto streetcar network. The city of Toronto publishes a dataset that describes every streetcar delay since January 2014 . The challenge is to use this dataset to train a machine learning model that can predict whether a given streetcar trip will be delayed.

To illustrate the key points of the book, I created a deep learning approach to the streetcar delay prediction problem using a Keras functional model. This solution includes a set of modules to clean up the data, build and train the model, and deploy the trained model . To make a fair comparison between the two machine learning approaches, my goal was to replace the Keras deep learning model with XGBoost with minimal changes to the rest of the code. Imagine that the whole solution, from ingestion of the raw data to deployment of the trained model, is a car. I wanted to replace the car’s engine (the machine learning model) without altering the bodywork, electrical system, interior, or any other aspects of the car.

Cage Match: XGBoost vs. Keras Deep Learning

Swapping out the engine and leaving the rest of the car unchanged (illustration by author)

I was pleasantly surprised by how easy it was to replace the Keras deep learning model with XGBoost. The following sections describe the steps I took to convert the notebook that contains the code to train the Keras model into a notebook that trains an XGBoost model .

Refactor the data used to train and test the model

The deep learning model is a multi-input Keras functional model that expects to be trained on a list of numpy arrays, as shown in the following snippet:

In contrast, the XGBoost model expects to be trained on a numpy array of lists. I needed to convert the training and test data from the format expected by Keras into the format expected by XGBoost. First, I converted the test and train datasets from lists of numpy arrays into lists of lists:

I cringed a bit at using a for loop to do this — I am sure there is a more Pythonic way — but this cell ran quickly enough and I wanted to have code that was easily readable.

Next, I converted each of the lists of lists from the previous step into a numpy array of lists, transposed to get the correct organization of the data:

The output of these transformations is the data in the form we want for XGBoost — a numpy array of lists:

The following diagram shows how the values from the original form of the data (a list of numpy arrays) end up in the target form of the data (a numpy array of lists):

Cage Match: XGBoost vs. Keras Deep Learning

Translation of data from the format required by Keras to the format required by XGBoost

Train and apply the XGBoost model

Now that I had the data in the format required by XGBoost, I was ready to train the XGBoost model. The following snippet shows the code to train and save the model:

I used a single non-default parameter in the XGBoost fit statement: setting scale_pos_weight to one_weight . This parameter let me account for the imbalance in the dataset between the negative case (no streetcar delay) and the positive case (streetcar delay). Only about 2% of the records in the dataset represent streetcar delays. The scale_pos_weight parameter in the XGBoost fit statement is set to a value identical to the value used in the fit statement for the Keras model, where the “1” value of the class_weight parameter is set to one_weight , as shown in the following snippet:

Next, I applied the trained model to the test set and got predictions from the model for the test set.

And finally I assessed the accuracy of the XGBoost model:

Comparing the XGBoost and Keras Results

Now that we have results for the trained XGBoost model, we can compare the overall characteristics of the solution using Keras deep learning with the solution using XGBoost. The following table summarizes the results:

Cage Match: XGBoost vs. Keras Deep Learning

XGBoost vs. Keras result summary

Let’s look at each comparison category in a bit more detail:

  • XGBoost is the winner for performance , especially recall. Recall is critical for the use case of predicting streetcar delays — we want to minimize the model predicting no delay when there is going to be a delay (false negatives). If the model predicts a delay and there is no delay (false positive), the user may end up walking to their destination or taking a taxi instead of a streetcar. The ultimate impact is not that bad because the user still stands a decent chance of getting to their destination on time. However, with a false negative (the model predicts no delay when there is a delay), the impact is bad because the user will likely take the streetcar and risk being late getting to their destination. Thus, recall is critical for the streetcar delay prediction problem, and XGBoost has clearly better recall results.
  • Training time is a draw . On a local system with no GPUs and with a limited number of iterations, XGBoost has a faster training time. However, the training time for Keras varies widely from run to run and is dependent on the callback patience parameter. The patience parameter controls how many epochs the training run continues once the target performance measurement, such as validation accuracy, is no longer improving. Because the training time for Keras varies so much, I am calling this category inconclusive.
  • Code complexity is a draw . The Keras model has more complex code to build the layers of the functional model. However, as shown in the section above on refactoring the data used to train and test the model, XGBoost requires additional code to transform the data into the form that it expects. Because Keras has more complex code to build the model and XGBoost requires additional code to prepare the data, I am also calling this category a draw.
  • Keras is the winner for flexibility . The streetcar delay prediction problem is the subject of the extended example in the book Deep Learning with Structured Data , but the intention is that the code for the streetcar delay prediction problem could be applied to a broad variety of structured tabular datasets. In particular, if a column of the tabular dataset is identified as a free-form text column (for example, a description of an item in a retail site), then the Keras model will be automatically generated with layers to handle such a column. XGBoost does not have this ability to handle tabular datasets with continuous, categorical, and free-form text columns. I am asserting that the Keras approach has superior flexibility because it can handle a wider variety of tabular datasets.

Conclusions

In this article, I have described a comparison of two solutions to the streetcar delay prediction problem: one using XGBoost as the model, and the other using a Keras deep learning model. In this comparison I have kept the code for the two solutions as close as possible; I have only changed the parts of the code specifically related to the training and testing of the model. The results of the comparison show that XGBoost is better “out of the box” on raw performance, especially recall, and that Keras deep learning is more flexible.

Following are links to the code and initial dataset described in this article:


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

C++沉思录

C++沉思录

Andrew Koenig、Barbara Moo / 黄晓春、孟岩(审校) / 人民邮电出版社 / 2008-1 / 55.00元

《C++沉思录》基于作者在知名技术杂志发表的技术文章、世界各地发表的演讲以及斯坦福大学的课程讲义整理、写作而成,融聚了作者10多年C++程序生涯的真知灼见。全书分为6篇32章,分别对C++语言的历史和特点、类和继承、STL与泛型编程、库的设计等几大技术话题进行了详细而深入的讨论,细微之处几乎涵盖了C++所有的设计思想和技术细节。全书通过精心挑选的实例,向读者传达先进的程序设计的方法和理念。一起来看看 《C++沉思录》 这本书的介绍吧!

MD5 加密
MD5 加密

MD5 加密工具

XML、JSON 在线转换
XML、JSON 在线转换

在线XML、JSON转换工具

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具