Original article can be found here (source): Deep Learning on Medium
Imagine a single model combining the power of deep learning and the interpretability of statistics
Regressions or classification problems are usually analyzed with one of two models: a simple statistical model or machine learning. Despite its relative simplicity, the former has obvious interpretability benefits (e.g. significance of the variables used). In contrast, the latter is often more powerful, but it is commonly referred to as a black-box model due to its opacity.
The performance of any given model is directly linked to its degree of modularity. The higher the modularity usually means the higher the overall accuracy (let us omit overfitting for a second). However, the higher the modularity also means the more we lose track of how a given variable explains the target. In other words, performance and interpretability seem to be opposed by their very nature. But what if that wasn t always true?
In this article, I will firstly, build a powerful deep learning model from scratch and secondly, tune it in a way that allows us to interpret whether some explanatory variables can be said to significantly (positively or negatively) impact the target variable. This model will be denoted as the SDL model (for significant deep learning) for the rest of this article. Concepts such as feedforwarding information and backpropagating errors (through gradient descent) will be used and if you are not yet familiar with those concepts, articles explaining neural networks and deep learning might help your understanding of the article.
Loading the data used to train the SDL model
The model will be trained with some publicly available data. Opting for a classification problem, the sign of the monthly logarithmic return of the IBM stock (publicly available information) was chosen as the target variable. The Fama & French five factors will be used as explanatory variables. These variables are Rm-rf (the return spread of the market compared to the risk-free rate), SMB (the difference in performance between the largest and the smallest companies), HML (the difference in performance between firms with the highest and those with the lowest absolute stocks), RMW (the difference in performance between the most profitable and the least profitable firms) and CMA (the difference in performance between firms that invest conservatively and those that invest aggressively). These different datasets can be found on my Github. Import first packages Pandas and Numpy and then run the following code to load all the data in the right format.
Understanding the broad structure of the SDL model
Let us now have a closer look at the logical structure of the SDL model. A good starting point is to see it either as a combination of various artificial neural networks or as a single artificial neural network with some weights being constrained to zero. To account for some possible autocorrelation, we use, for every aforementioned variable, the regressor value at time t as well as its four first lagged values. A first ANN with one hidden layer can then be applied to every group of five inputs related to the same variable (represented in blue, orange and yellow in the figure below). After optimization (see below), every system is giving a one-dimensional output. The series of five obtained outputs can then be used as the input layer of a last artificial neural network (represented in green) with no hidden layer (i.e., a simple logistic regression) and whose y-label is equal to one or zero depending on the sign of the IBM stock return.
It is however not possible to build the final model by merely combining these different sub-models. At the start, information related to the optimal values for the five neurons, being the respective outputs of the first five ANN, is unknown. It is therefore impossible to adjust weights to maximally reduce the cost function, as this function is itself computed using the difference between the actual and these unknown optimal values at the output layer. To overcome this issue, we will build the SDL model as a single supervised model with five times five features (the lagged values of Fama & French five factors) and one output (the sign of the IBM stock return). In the meantime, some constraints must be added on most of the weights for the model to respect the logic explained above. In the figure below, the dashed lines represent the weights that are constrained to zero.
Implementing the SDL model in Python
Step 1: initialization
Let us now move on to the Python implementation. Before the initialization of the NeuralNetwork class, which is presented below, a few necessary steps have been performed (train-test split and some data pre-processing to make sure that the data format is the right one). Feel free to check this code on my Github as well. We initialize a new model with four different features. X and y represent the inputs training matrix and the target training vector, respectively. Lamb1 is the regularization parameter, similar to what Ridge does in linear regressions and learningr is the learning rate of the model (how strongly it corrects the weights when backpropagating the errors). All the not-constrained weights are then initialized with some random numbers between -15 and 15. This wide range is used to introduce a substantial variability. One last thing worth mentioning here, the output vector (initialized with zero values) aims at containing the results obtained after having feedforwarded the inputs throughout the model. An error vector will be built as the difference between this vector and the y-vector.
Step 2: Feedforward
The next step of the reasoning consists of feedforwarding the inputs throughout the layers. To start with, the matrix of inputs (size n * 25 with n being the length of the dataset) is multiplied with the first matrix of weights (size 25*10). Each first hidden layer s neuron value is eventually equal to the output of the sigmoid function with, as variable, the related marginal result of the matrix multiplication. The same method is used to progress through the following layers. Everything is summed up mathematically here below.
In Python, the feedforward method within our NeuralNetwork class can be implemented in the following way.
Step 3: Error backpropagation
Once the output layer has been obtained, a cost function can be computed. This is used to gauge how far the predicted output is compared to the actual target. The next step of the reasoning consists in calculating the derivative of this cost function with respect to the different weights (starting with weights 3, then weights 2 and finally weights 1). This aims at understanding which part of the cost function is associated with which weight. It is then possible to update the weights in a way that minimizes the cost function (only a part of the derivative is subtracted to the current weight, depending on the learning rate chosen). The first mathematical steps are summed up here below calculations are based on the chain rule principle and the symbol .* is used to denote element-wise matrix multiplication. Note that, as from the second hidden layer, we start the reasoning again with the marginal elements of the error vector.
The most relevant lines of code are presented here below. Please note that once again, many lines of lesser importance are not shown here but are available on my Github.
Step 4: Optimization and choice of the best values for the lambda and learning rate parameters
The last step of the reasoning consists of repeating step 2 and step 3 (i.e., feedforwarding information with constantly evolving weights and backpropagating errors). This process eventually stops when the cost function does not decrease anymore. We used a simplified cross-validation to select the best values of the lambda and the learning rate parameters (i.e., computing the classification accuracy on a given test set for all the pairs of lambda- learning rate tested, then keeping the pair for which the model performs best). On the financial data introduced earlier, the best classification accuracy on the test set (around 70% some IBM idiosyncratic variables can be added to increase this number) is obtained with the tuple (lambda, learning rate) equal to (0.15, 0.3).
Step 5: Significance of the variables
It is now demonstrated that the SDL model is performing well. It manages to most of the time correctly classify the sign of the IBM return based on both present and past values of market-related variables. Half of the job is done!
The following methodology will allow to interpret how the chosen variables help in explaining the probability of an IBM return to be positive. You may have noticed that our SDL is, in essence, a combination of two models. The first one is a combination of five neural networks with one hidden layer. The way the SDL model is optimized ensures that the value at the final neuron of each of these sub-models is the result of the most optimal transformation of the five temporal inputs of the same exogenous variable (it obviously might not be the case for each single data row, while it checks out when considering all the data). These optimally transformed variables are then used as inputs in the second model component (the last neural network with no hidden layer basically a logistic regression). A certain significance can be derived from this second model. For every exogenous variable, the mean of the vector obtained by multiplying the last neuron value of the associated sub-model (a vector), with the weight applied on this last neuron (a scalar) will be used for that purpose. This gives an idea of the average absolute number that is plugged in the sigmoid function to eventually estimate the probability of the IBM stock return to be positive. This estimate should be accurate and not only the result of a lucky convergence of the SDL model. A non-parametric distribution for this estimator is therefore derived. That distribution is built upon the set of values obtained by running the same code 100 times. If the quantiles 2.5% and 97.5% of this distribution are either both positive or both negative, it can be concluded that the exogenous variable is significant ( =5%), respectively positive or negative!
As illustrated below, the variable (Mkt-rf) is (at any level of certitude as the whole distribution is greater than zero) positively correlated with the sign of the IBM return. In other words, the greater the market spread, the more likely the IBM return to be positive, which seems logical. Similarly, the variable CMA seems to be negatively impacting the probability of an IBM return to be positive. In other words, the greater the difference of performance between firms that invest conservatively and those that invest aggressively, the lower (ceteris paribus) the probability of a positive IBM return. Due to the long right tail, this relationship is significant but only at any level of confidence 1- -with alpha >1%.
The code for this section is available here below.
Conclusion and known limitations of the model
To sum up, the results obtained while running the Python implementation of the SDL model show that performance and interpretability can, to a certain extent, be combined. The performance is associated with the proper IBM stock return classification and the interpretability is associated with the positive or negative significance linked to the exogenous variables used.
However, this methodology is not perfect. For instance, the fact that the same model has to be run many times to obtain the distribution for the final estimator is time and CPU consuming. From a methodology point of view, the cross-validation to select the best values of the parameters is only performed on one sample. Lastly, it could be argued the significance derived in the last step might no longer be accurate in cases when the inputs can take both positive and negative values, but that can be averted with a proper initial standardization of the input values.
This is only the starting point of my research on this innovative significant deep learning model. I am looking forward to feedback and thoughts on how to bring this research to the next level, and I would love to discuss any ideas you might have, about the model explained in this article or about other logical structures that could reconcile performance and interpretability.
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
莱维丁 (Anany Levitin) / 清华大学出版社 / 2013-5-1 / CNY 79.00
《算法设计与分析基础(第3版 影印版)》在讲述算法设计技术时采用了新的分类方法,在讨论分析方法时条分缕析,形成了连贯有序、耳目一新的风格。为便于学生掌握,本书涵盖算法入门课程的全部内容,更注重对概念(而非形式)的理解。书中通过一些流行的谜题来激发学生的兴趣,帮助他们加强和提高解决算法问题的能力。每章小结、习题提示和详细解答,形成了非常鲜明的教学特色。 《算法设计与分析基础(第3版 影印版)》......一起来看看 《算法设计与分析基础》 这本书的介绍吧!