内容简介:Hence, I decided to do a project using tabular to demonstrate the use of entity embeddings. The dataset I use is the IEEE-CIS Fraud Detection data from Kaggle which you can findHere are the step-by-step codes (including Google Colab specific code as I work
Hence, I decided to do a project using tabular to demonstrate the use of entity embeddings. The dataset I use is the IEEE-CIS Fraud Detection data from Kaggle which you can find here .
Here are the step-by-step codes (including Google Colab specific code as I worked on colab).
Firstly, to check the GPU allocated to you in colab you can run the following code.
!nvidia-smi
Next to mount to my google drive,
from google.colab import drivedrive.mount('/content/drive')
Download the dataset from Kaggle, you would need your Kaggle API token for this. If you need any help downloading dataset from Kaggle, this might help.
!mkdir /root/.kaggle!echo '{"username":"USERNAME","key":"KEY"}' > /root/.kaggle/kaggle.json!chmod 600 /root/.kaggle/kaggle.json!kaggle competitions download -c ieee-fraud-detection# unzip all files!unzip train_transaction.csv.zip !unzip test_transaction.csv.zip
Then, read the csv files into pandas dataframe
train = pd.read_csv("train_transaction.csv") test = pd.read_csv("test_transaction.csv")
As this is a fraud detection dataset, having an imbalance data shouldn’t be surprising.
train["isFraud"].mean() # 0.03499000914417313
As data exploration and feature engineering is not the purpose of this post, I will use minimum features to predict the fraud label. To make sure you can replicate my code, here are my processing steps.
# generate time of daytrain["Time of Day"] = np.floor(train["TransactionDT"]/3600/183) test["Time of Day"] = np.floor(test["TransactionDT"]/3600/183)# drop columnstrain.drop("TransactionDT",axis=1,inplace=True) test.drop("TransactionDT",axis=1,inplace=True)# define continuous and categorical variablescont_vars = ["TransactionAmt"] cat_vars = ["ProductCD","addr1","addr2","P_emaildomain","R_emaildomain","Time of Day"] + [col for col in train.columns if "card" in col]# set training and testing setx_train = train[cont_vars + cat_vars].copy() y_train = train["isFraud"].copy() x_test = train[cont_vars + cat_vars].copy() y_test = train["isFraud"].copy()# process cont_vars # scale valuesfrom sklearn.preprocessing import StandardScaler scaler = StandardScaler() x_train["TransactionAmt"] = scaler.fit_transform(x_train["TransactionAmt"].values.reshape(-1,1)) x_test["TransactionAmt"] = scaler.transform(x_test["TransactionAmt"].values.reshape(-1,1))# reduce cardinality of categorical variablesidx_list = x_train["card1"].value_counts()[x_train["card1"].value_counts()<=100].index.tolist() x_train.loc[x_train["card1"].isin(idx_list),"card1"] = "Others" x_test.loc[x_test["card1"].isin(idx_list),"card1"] = "Others"# fill missingx_train[cat_vars] = x_train[cat_vars].fillna("Missing") x_test[cat_vars] = x_test[cat_vars].fillna("Missing")for cat, index in categories.items():test[cat] = pd.Categorical(test[cat],categories=categories[cat],ordered=True)
After the processing steps have been done, now we can convert the categorical variables into integers.
# convert to numerical value for modellingdef categorify(df, cat_vars):categories = {}for cat in cat_vars: df[cat] = df[cat].astype("category").cat.as_ordered() categories[cat] = df[cat].cat.categoriesreturn categories def apply_test(test,categories):for cat, index in categories.items(): test[cat] = pd.Categorical(test[cat],categories=categories[cat],ordered=True)# convert to integers categories = categorify(x_train,cat_vars) apply_test(x_test,categories)for cat in cat_vars: x_train[cat] = x_train[cat].cat.codes+1 x_test[cat] = x_test[cat].cat.codes+1
Due to the higly imbalanced dataset, I have to artificially generate more fraud data using a technique called Synthetic Minority Over-sampling Technique (SMOTE). The documentation can be found here .
以上所述就是小编给大家介绍的《The Right Way to Use Deep Learning for Tabular Data | Entity Embedding》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
The Definitive Guide to HTML5 WebSocket
Vanessa Wang、Frank Salim、Peter Moskovits / Apress / 2013-3 / USD 26.30
The browser is, hands down, the most popular and ubiquitous deployment platform available to us today: virtually every computer, smartphone, tablet, and just about every other form factor imaginable c......一起来看看 《The Definitive Guide to HTML5 WebSocket》 这本书的介绍吧!
URL 编码/解码
URL 编码/解码
RGB CMYK 转换工具
RGB CMYK 互转工具