Dating Algorithms using Machine Learning and AI

栏目: IT技术 · 发布时间: 5年前

内容简介:To begin, we must first import all the necessary libraries we will need in order for this clustering algorithm to run properly. We will also load in the Pandas DataFrame, which we created when we forged the fake dating profiles.With our dataset good to go,

Preparing the Profile Data

To begin, we must first import all the necessary libraries we will need in order for this clustering algorithm to run properly. We will also load in the Pandas DataFrame, which we created when we forged the fake dating profiles.

The DataFrame containing all our data for each fake dating profile

With our dataset good to go, we can begin the next step for our clustering algorithm.

Scaling the Data

The next step, which will assist our clustering algorithm’s performance, is scaling the dating categories ( Movies, TV, religion, etc ). This will potentially decrease the time it takes to fit and transform our clustering algorithm to the dataset.

# Instantiating the Scaler
scaler = MinMaxScaler()# Scaling the categories then replacing the old values
df = df[['Bios']].join(
 pd.DataFrame(
 scaler.fit_transform(
 df.drop('Bios',axis=1)), 
 columns=df.columns[1:], 
 index=df.index))

Vectorizing the Bios

Next, we will have to vectorize the bios we have from the fake profiles. We will be creating a new DataFrame containing the vectorized bios and dropping the original ‘ Bio ’ column. With vectorization we will implementing two different approaches to see if they have significant effect on the clustering algorithm. Those two vectorization approaches are: Count Vectorization and TFIDF Vectorization . We will be experimenting with both approaches to find the optimum vectorization method.

Here we have the option of either using CountVectorizer() or TfidfVectorizer() for vectorizing the dating profile bios. When the Bios have been vectorized and placed into their own DataFrame, we will concatenate them with the scaled dating categories to create a new DataFrame with all the features we need.

Our DF that includes the vectorized bios and scaled dating categories

Based on this final DF, we have more than 100 features. Because of this, we will have to reduce the dimensionality of our dataset by using Principal Component Analysis (PCA).

PCA on the DataFrame

In order for us to reduce this large feature set, we will have to implement Principal Component Analysis (PCA) . This technique will reduce the dimensionality of our dataset but still retain much of the variability or valuable statistical information.

What we are doing here is fitting and transforming our last DF, then plotting the variance and the number of features. This plot will visually tell us how many features account for the variance.

# of Features accounting for % of the Variance

After running our code, the number of features that account for 95% of the variance is 74 . With that number in mind, we can apply it to our PCA function to reduce the number of Principal Components or Features in our last DF to 74 from 117 . These features will now be used instead of the original DF to fit to our clustering algorithm.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

数字麦克卢汉

数字麦克卢汉

(美)保罗﹒莱文森(Paul Levinson) / 何道宽 / 社会科学文献出版社 / 2001年 / 20.0

本书是一本三合一的书。既是麦克卢汉评传,又是一部专著,而且是让网民“扫盲”和提高的指南。 《数字麦克卢汉》实际上有两个平行的主题和任务。一个是批评和张扬麦克卢汉。另一个是写作者自己的思想。它“不仅谋求提供进入数字时代的向导……而且谋求证明麦克卢汉思想隐而不显的准确性。为了完成这个双重任务,本书的每一章都试图阐明麦克卢汉的一种重要的洞见、原则或概念。与此同时,它试图揭示麦克卢汉告诉我们一些什么......一起来看看 《数字麦克卢汉》 这本书的介绍吧!

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具