内容简介:This article is a guide to setting up your computer ready to work on data science projects. Many of the tools I have listed are my personally preferred tools. However, in most cases, there are several alternatives. It is worth exploring the different optio
Setting Up Your Data Science Work Bench
Get your computer ready for learning data science
Mar 28 ·6min read
In my lastpost, I covered the core tools required for data science work. In this article, I am going to give a step by step guide to getting your computer set up to perform typical data science and machine learning tasks.
I personally work on a mac so most set up instructions will be set up for this operating system.
Install python
As discussed in my last post python is now the most popular programming language for data science practitioners. Therefore the first step in configuring your computer is to install python.
To install and configure python on your computer you will need to use the terminal. If you have not already set this up you will need to download and install Xcode (Apple’s Integrated Development Environment).
Mac OS X comes with python 2.7 already installed. However, for many data science projects you will need to be able to work with a variety of different python versions.
There are a number of tools available that enable the installation and management of different python versions however pyenv is probably one of the simplest to use. Pyenv supports the management of python versions at both the user and project level.
To install pyenv you will need to first install homebrew , which is a package manager for Mac. Once you have this you can install pyenv with this command (for Windows installation see these instructions ).
brew install pyenv
You will then need to add the pyenv initializer to your shell startup scripts and reload your bash_profile by running the following.
echo 'eval "$(pyenv init -)"' >> ~/.bash_profile source ~/.bash_profile
To view versions of python that are installed on your system run the following command.
pyenv install --list
To install a new version of python simply run.
pyenv install <python-version>#e.g.pyenv install python-3.7
Install python packages
Pip is the preferred installer for installing python packages and is included by default with python 3.4 and above. You will need this to install any open-source python libraries.
To install a package using pip simply run the following.
pip install <package>#e.g.pip install pandas
Virtual environments
Different python projects will require different dependencies and versions of python. It is therefore important to have a way to create isolated and reproducible environments for each project. Virtual environments accomplish this.
There are a number of tools to create python virtual environments but I personally use pipenv .
Pipenv can be installed with homebrew.
brew install pipenv
To create a new environment using a specific version of python. Make a new directory and then run the following command from your new directory.
mkdir pip-test cd pip-test pipenv --python 3.7
To activate the environment run pipenv-shell
you will now be in a new environment called ‘pip-test’.
If we inspect the contents of the directory you will see that pipenv has created a new file called Pipfile
. This is the pipenv equivalent of a requirements file and contains all packages and versions that are used in the environment.
To install packages into the pipenv environment simply run the following.
pipenv install pandas
Any packages installed will be reflected in the pip file which means that the environment can be recreated easily using this file.
Jupyter Notebooks
Jupyter Notebooks are a web-based application for writing code. They are particularly suited to data science tasks because they enable you to render documentation, diagrams, tables and charts directly in line with your code. This creates a highly interactive and shareable platform for developing data science projects.
To install Jupyter notebooks simply run pip install notebook
or if you are working in a pipenv shell pipenv install notebook
.
As this is a web-based application you need to start the notebook server to begin writing your code. You do this by running this command.
jupyter notebook
This will open the application in your web browser, the default URL is http://127.0.0.1:8888
.
Jupyter Notebooks are able to work with virtual environments so that you are able to run the notebooks for a project in the correct project environment. To make the pipenv environment available in the web application you need to run the following.
python -m ipykernel install --user --name myenv --display-name "Python (myenv)"#e.g.python -m ipykernel install --user --name jup-test --display-name "Python (jup-test)"
If you now restart the web application and got to new
you will see your pipenv environment available. Selecting this will start a new notebook that will run with all the dependencies you have set up in your pipenv shell.
Python IDE
Jupyter Notebooks are very good for exploratory data science projects and for writing code that you will only use once. However, for efficiency, it is a good idea to write commonly used pieces of code into functions within modules that can be imported and used across projects (this is known as modularising your code).
Notebooks are not particularly well suited to writing modules. For this task, it is better to use an IDE (Integrated Development Environment). There are many available but I personally use Pycharm. The benefit of using IDE’s is that they contain tools such as Github integration and unit testing built-in.
Pycharm has both a paid professional version and a free community edition. To download and install Pycharm visit this website and follow the installation instructions.
Version control
One final tool you will want to use for your data science projects is Github. This is the most commonly used tool for version control. Version control essentially involves storing a version of your project online. Development is then performed locally from branches. Branches are essentially a copy of the project where changes can be made that will not affect the master version.
Once changes have been made locally you can push the changes to Github and they can be merged into the master branch in a controlled process known as a pull request.
Using Github will enable you to track changes to your project. You can also make changes and test the impact they have before integrating them into the final version. Github also enables collaboration with others as they can safely make changes without impacting the master branch.
To use Github you first need to install it which can be done by following these instructions . You will then need to visit the Github website and create an account.
Once you have an account you can create a new repository.
To work on the project locally you will need to clone the repository.
cd my-directory
git clone https://github.com/rebeccavickery/my-repository.git
This article is a guide to setting up your computer ready to work on data science projects. Many of the tools I have listed are my personally preferred tools. However, in most cases, there are several alternatives. It is worth exploring the different options to find those best suited to your working style and projects.
Thanks for reading!
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
RGB转16进制工具
RGB HEX 互转工具
图片转BASE64编码
在线图片转Base64编码工具