The Right Way to Use Deep Learning for Tabular Data | Entity Embedding

栏目: IT技术 · 发布时间: 4年前

内容简介：Hence, I decided to do a project using tabular to demonstrate the use of entity embeddings. The dataset I use is the IEEE-CIS Fraud Detection data from Kaggle which you can findHere are the step-by-step codes (including Google Colab specific code as I work

Hence, I decided to do a project using tabular to demonstrate the use of entity embeddings. The dataset I use is the IEEE-CIS Fraud Detection data from Kaggle which you can find here .

Here are the step-by-step codes (including Google Colab specific code as I worked on colab).

Firstly, to check the GPU allocated to you in colab you can run the following code.

!nvidia-smi

Next to mount to my google drive,

from google.colab import drivedrive.mount('/content/drive')

Download the dataset from Kaggle, you would need your Kaggle API token for this. If you need any help downloading dataset from Kaggle, this might help.

!mkdir /root/.kaggle!echo '{"username":"USERNAME","key":"KEY"}' > /root/.kaggle/kaggle.json!chmod 600 /root/.kaggle/kaggle.json!kaggle competitions download -c ieee-fraud-detection# unzip all files!unzip train_transaction.csv.zip
!unzip test_transaction.csv.zip

Then, read the csv files into pandas dataframe

train = pd.read_csv("train_transaction.csv")
test = pd.read_csv("test_transaction.csv")

As this is a fraud detection dataset, having an imbalance data shouldn’t be surprising.

train["isFraud"].mean() # 0.03499000914417313

As data exploration and feature engineering is not the purpose of this post, I will use minimum features to predict the fraud label. To make sure you can replicate my code, here are my processing steps.

# generate time of daytrain["Time of Day"] = np.floor(train["TransactionDT"]/3600/183)
test["Time of Day"] = np.floor(test["TransactionDT"]/3600/183)# drop columnstrain.drop("TransactionDT",axis=1,inplace=True)
test.drop("TransactionDT",axis=1,inplace=True)# define continuous and categorical variablescont_vars = ["TransactionAmt"]
cat_vars = ["ProductCD","addr1","addr2","P_emaildomain","R_emaildomain","Time of Day"] + [col for col in train.columns if "card" in col]# set training and testing setx_train = train[cont_vars + cat_vars].copy()
y_train = train["isFraud"].copy()
x_test = train[cont_vars + cat_vars].copy()
y_test = train["isFraud"].copy()# process cont_vars
# scale valuesfrom sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train["TransactionAmt"] = scaler.fit_transform(x_train["TransactionAmt"].values.reshape(-1,1))
x_test["TransactionAmt"] = scaler.transform(x_test["TransactionAmt"].values.reshape(-1,1))# reduce cardinality of categorical variablesidx_list = x_train["card1"].value_counts()[x_train["card1"].value_counts()<=100].index.tolist()
x_train.loc[x_train["card1"].isin(idx_list),"card1"] = "Others"
x_test.loc[x_test["card1"].isin(idx_list),"card1"] = "Others"# fill missingx_train[cat_vars] = x_train[cat_vars].fillna("Missing")
x_test[cat_vars] = x_test[cat_vars].fillna("Missing")for cat, index in categories.items():test[cat] = pd.Categorical(test[cat],categories=categories[cat],ordered=True)

After the processing steps have been done, now we can convert the categorical variables into integers.

# convert to numerical value for modellingdef categorify(df, cat_vars):categories = {}for cat in cat_vars:
df[cat] = df[cat].astype("category").cat.as_ordered()
categories[cat] = df[cat].cat.categoriesreturn categories
def apply_test(test,categories):for cat, index in categories.items():
test[cat] = pd.Categorical(test[cat],categories=categories[cat],ordered=True)# convert to integers
categories = categorify(x_train,cat_vars)
apply_test(x_test,categories)for cat in cat_vars:
x_train[cat] = x_train[cat].cat.codes+1
x_test[cat] = x_test[cat].cat.codes+1

Due to the higly imbalanced dataset, I have to artificially generate more fraud data using a technique called Synthetic Minority Over-sampling Technique (SMOTE). The documentation can be found here .

以上所述就是小编给大家介绍的《The Right Way to Use Deep Learning for Tabular Data | Entity Embedding》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！