The Right Way to Use Deep Learning for Tabular Data | Entity Embedding

栏目: IT技术 · 发布时间: 4年前

内容简介:Hence, I decided to do a project using tabular to demonstrate the use of entity embeddings. The dataset I use is the IEEE-CIS Fraud Detection data from Kaggle which you can findHere are the step-by-step codes (including Google Colab specific code as I work

Hence, I decided to do a project using tabular to demonstrate the use of entity embeddings. The dataset I use is the IEEE-CIS Fraud Detection data from Kaggle which you can find here .

Here are the step-by-step codes (including Google Colab specific code as I worked on colab).

Firstly, to check the GPU allocated to you in colab you can run the following code.

!nvidia-smi

Next to mount to my google drive,

from google.colab import drivedrive.mount('/content/drive')

Download the dataset from Kaggle, you would need your Kaggle API token for this. If you need any help downloading dataset from Kaggle, this might help.

!mkdir /root/.kaggle!echo '{"username":"USERNAME","key":"KEY"}' > /root/.kaggle/kaggle.json!chmod 600 /root/.kaggle/kaggle.json!kaggle competitions download -c ieee-fraud-detection# unzip all files!unzip train_transaction.csv.zip
!unzip test_transaction.csv.zip

Then, read the csv files into pandas dataframe

train = pd.read_csv("train_transaction.csv")
test = pd.read_csv("test_transaction.csv")

As this is a fraud detection dataset, having an imbalance data shouldn’t be surprising.

train["isFraud"].mean() # 0.03499000914417313

As data exploration and feature engineering is not the purpose of this post, I will use minimum features to predict the fraud label. To make sure you can replicate my code, here are my processing steps.

# generate time of daytrain["Time of Day"] = np.floor(train["TransactionDT"]/3600/183)
test["Time of Day"] = np.floor(test["TransactionDT"]/3600/183)# drop columnstrain.drop("TransactionDT",axis=1,inplace=True)
test.drop("TransactionDT",axis=1,inplace=True)# define continuous and categorical variablescont_vars = ["TransactionAmt"]
cat_vars = ["ProductCD","addr1","addr2","P_emaildomain","R_emaildomain","Time of Day"] + [col for col in train.columns if "card" in col]# set training and testing setx_train = train[cont_vars + cat_vars].copy()
y_train = train["isFraud"].copy()
x_test = train[cont_vars + cat_vars].copy()
y_test = train["isFraud"].copy()# process cont_vars
# scale valuesfrom sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train["TransactionAmt"] = scaler.fit_transform(x_train["TransactionAmt"].values.reshape(-1,1))
x_test["TransactionAmt"] = scaler.transform(x_test["TransactionAmt"].values.reshape(-1,1))# reduce cardinality of categorical variablesidx_list = x_train["card1"].value_counts()[x_train["card1"].value_counts()<=100].index.tolist()
x_train.loc[x_train["card1"].isin(idx_list),"card1"] = "Others"
x_test.loc[x_test["card1"].isin(idx_list),"card1"] = "Others"# fill missingx_train[cat_vars] = x_train[cat_vars].fillna("Missing")
x_test[cat_vars] = x_test[cat_vars].fillna("Missing")for cat, index in categories.items():test[cat] = pd.Categorical(test[cat],categories=categories[cat],ordered=True)

After the processing steps have been done, now we can convert the categorical variables into integers.

# convert to numerical value for modellingdef categorify(df, cat_vars):categories = {}for cat in cat_vars:
df[cat] = df[cat].astype("category").cat.as_ordered()
categories[cat] = df[cat].cat.categoriesreturn categories
def apply_test(test,categories):for cat, index in categories.items():
test[cat] = pd.Categorical(test[cat],categories=categories[cat],ordered=True)# convert to integers
categories = categorify(x_train,cat_vars)
apply_test(x_test,categories)for cat in cat_vars:
x_train[cat] = x_train[cat].cat.codes+1
x_test[cat] = x_test[cat].cat.codes+1

Due to the higly imbalanced dataset, I have to artificially generate more fraud data using a technique called Synthetic Minority Over-sampling Technique (SMOTE). The documentation can be found here .


以上所述就是小编给大家介绍的《The Right Way to Use Deep Learning for Tabular Data | Entity Embedding》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

平台革命:改变世界的商业模式

平台革命:改变世界的商业模式

[美]杰奥夫雷G.帕克(Geoffrey G. Parker)、马歇尔W.范·埃尔斯泰恩(Marshall W. Van Alstyne)、桑基特·保罗·邱达利(Sangeet Paul Choudary) / 志鹏 / 机械工业出版社 / 2017-10 / 65.00

《平台革命》一书从网络效应、平台的体系结构、颠覆市场、平台上线、盈利模式、平台开放的标准、平台治理、平台的衡量指标、平台战略、平台监管的10个视角,清晰地为读者提供了平台模式最权威的指导。 硅谷著名投资人马克·安德森曾经说过:“软件正在吞食整个世界。”而《平台革命》进一步指出:“平台正在吞食整个世界”。以平台为导向的经济变革为社会和商业机构创造了巨大的价值,包括创造财富、增长、满足人类的需求......一起来看看 《平台革命:改变世界的商业模式》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

RGB HSV 转换
RGB HSV 转换

RGB HSV 互转工具