内容简介:In ourIn this post, we will see how we can implement topic modeling in Power BI using PyCaret. If you haven’t heard about PyCaret before, please read thisIf you have used Python before, it is likely that you already have Anaconda Distribution installed on
Topic Modeling in Power BI using PyCaret
In our last post , we demonstrated how to implement clustering analysis in Power BI by integrating it with PyCaret, thus allowing analysts and data scientists to add a layer of machine learning to their reports and dashboards without any additional license costs.
In this post, we will see how we can implement topic modeling in Power BI using PyCaret. If you haven’t heard about PyCaret before, please read this announcement to learn more.
Learning Goals of this Tutorial
- What is Natural Language Processing?
- What is Topic Modeling?
- Train and implement a Latent Dirichlet Allocation model in Power BI.
- Analyze results and visualize information in a dashboard.
Before we start
If you have used Python before, it is likely that you already have Anaconda Distribution installed on your computer. If not, click here to download Anaconda Distribution with Python 3.7 or greater.
Setting up the Environment
Before we start using PyCaret’s machine learning capabilities in Power BI we have to create a virtual environment and install pycaret. It’s a four-step process:
✅ Step 1 — Create an anaconda environment
Open Anaconda Prompt from start menu and execute the following code:
conda create --name powerbi python=3.7
“powerbi” is the name of environment we have chosen. You can keep whatever name you would like.
✅ Step 2 — Install PyCaret
Execute the following code in Anaconda Prompt:
pip install pycaret
Installation may take 15–20 minutes. If you are having issues with installation, please see our GitHub page for known issues and resolutions.
✅ Step 3 — Set Python Directory in Power BI
The virtual environment created must be linked with Power BI. This can be done using Global Settings in Power BI Desktop (File → Options → Global → Python scripting). Anaconda Environment by default is installed under:
C:\Users\ username \Anaconda3\envs\
✅ Step 4 — Install Language Model
In order to perform NLP tasks you must download language model by executing following code in your Anaconda Prompt.
First activate your conda environment in Anaconda Prompt:
conda activate powerbi
Download English Language Model:
python -m spacy download en_core_web_sm python -m textblob.download_corpora
What is Natural Language Processing?
Natural language processing (NLP) is a subfield of computer science and artificial intelligence that is concerned with the interactions between computers and human languages. In particular, NLP covers broad range of techniques on how to program computers to process and analyze large amounts of natural language data.
NLP-powered software helps us in our daily lives in various ways and it is likely that you have been using it without even knowing. Some examples are:
- Personal assistants : Siri, Cortana, Alexa.
- Auto-complete : In search engines ( e.g: Google, Bing, Baidu, Yahoo).
- Spell checking : Almost everywhere, in your browser, your IDE ( e.g: Visual Studio), desktop apps ( e.g: Microsoft Word).
- Machine Translation : Google Translate.
- Document Summarization Software: Text compactor, Autosummarizer.
Topic Modeling is a type of statistical model used for discovering abstract topics in text data. It is one of many practical applications within NLP.
What is Topic Modeling?
A topic model is a type of statistical model that falls under unsupervised machine learning and is used for discovering abstract topics in text data. The goal of topic modeling is to automatically find the topics / themes in a set of documents.
Some common use-cases for topic modeling are:
- Summarizing large text data by classifying documents into topics ( the idea is pretty similar to clustering ).
- Exploratory Data Analysis to gain understanding of data such as customer feedback forms, amazon reviews, survey results etc.
- Feature Engineering creating features for supervised machine learning experiments such as classification or regression
There are several algorithms used for topic modeling. Some common ones are Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), and Non-Negative Matrix Factorization (NMF). Each algorithm has its own mathematical details which will not be covered in this tutorial. We will implement a Latent Dirichlet Allocation (LDA) model in Power BI using PyCaret’s NLP module.
If you are interested in learning the technical details of the LDA algorithm, you can read this paper .
Text preprocessing for Topic Modeling
In order to get meaningful results from topic modeling text data must be processed before feeding it to the algorithm. This is common with almost all NLP tasks. The preprocessing of text is different from the classical preprocessing techniques often used in machine learning when dealing with structured data (data in rows and columns).
PyCaret automatically preprocess text data by applying over 15 techniques such as stop word removal , tokenization , lemmatization , bi-gram/tri-gram extraction etc . If you would like to learn more about all the text preprocessing features available in PyCaret, click here .
Setting the Business Context
Kiva is an international non-profit founded in 2005 in San Francisco. Its mission is to expand financial access to underserved communities in order to help them thrive.
In this tutorial we will use the open dataset from Kiva which contains loan information on 6,818 approved loan applicants. The dataset includes information such as loan amount, country, gender and some text data which is the application submitted by the borrower.
Our objective is to analyze the text data in the ‘ en ’ column to find abstract topics and then use them to evaluate the effect of certain topics (or certain types of loans) on the default rate.
Let’s get started
Now that you have set up the Anaconda Environment, understand topic modeling and have the business context for this tutorial, let’s get started.
1. Get Data
The first step is importing the dataset into Power BI Desktop. You can load the data using a web connector. (Power BI Desktop → Get Data → From Web).
2. Model Training
To train a topic model in Power BI we will have to execute a Python script in Power Query Editor (Power Query Editor → Transform → Run python script). Run the following code as a Python script:
from pycaret.nlp import *
dataset = get_topics(dataset, text='en')
There are 5 ready-to-use topic models available in PyCaret.
By default, PyCaret trains a Latent Dirichlet Allocation (LDA) model with 4 topics. Default values can be changed easily:
- To change the model type use the model parameter within get_topics() .
- To change the number of topics, use the num_topics parameter.
See the example code for a Non-Negative Matrix Factorization model with 6 topics.
from pycaret.nlp import *
dataset = get_topics(dataset, text='en', model='nmf', num_topics=6)
Output:
New columns containing topic weights are attached to the original dataset. Here’s how the final output looks like in Power BI once you apply the query.
3. Dashboard
Once you have topic weights in Power BI, here’s an example of how you can visualize it in dashboard to generate insights:
You can download the PBIX file and the data set from our GitHub .
If you would like to learn more about implementing Topic Modeling in Jupyter notebook using PyCaret, watch this 2 minute video tutorial:
If you are Interested in learning more about Topic Modeling, you can also checkout our NLP 101 Notebook Tutorial for beginners.
以上所述就是小编给大家介绍的《Topic Modeling in Power BI using PyCaret》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
随机密码生成器
多种字符组合密码
UNIX 时间戳转换
UNIX 时间戳转换