Azure Machine Learning Service — Where is My Data?

栏目: IT技术 · 发布时间: 5年前

Azure Machine Learning Service — Where is My Data?

Part 4: An introduction to Datasets & Datastores

Abstract

Perhaps this is one of the important aspects of the Cloud machine learning platform. The dire need for every machine learning problem is its data. Mostly this data comes from various diverse sources, it is then refined, pruned, and massaged for various analyses and its consumption into the ML model. Due to this reason, the cloud machine learning defines data management SDK’s which can define, secure, and manage the data sources in the cloud platform.

This particular post is extracted from the Kaggle notebook hosted — here . Use the link to setup to execute the experiment.

Azure Machine Learning Service — Where is My Data?

Photo by Lukas from Pexels

Datastores and Datasets

Datastores

Datastores is a data management capability and the SDK provided by the Azure Machine Learning Service (AML). It enables us to connect to the various data sources and then those can be used to ingest them into the ML experiment or write outputs from the same experiments. Azure provides various platform services that can be enabled as a data source, e.g., blob store, data lake, SQL database, Databricks, and many others.

The Azure ML workspace has a natural integration with the datastores defined in Azure, such as Blob Storage and File Storage. But, executing the ML model may require data and its dependencies from other external sources. Hence, the AML SDK provides us the way to register these external sources as a Datasource for model experiments. The ability to define a datastore enables us to reuse the data across multiple experiments, regardless of the compute context in which the experiment is running.

Register Datastores

As discussed, Datastoes are of two types — Default and user provisioned, such as Storage Blobs containers or file storage. To get the list of default Datasores of a workspace:

# get the name of defult Datastore associated with the workspace.
default_dsname = ws.get_default_datastore().name
default_ds = ws.get_default_datastore()
print('default Datastore = ', default_dsname)

To register the database using AML SDK —

from azureml.core import Workspace, Datastore
ws = Workspace.from_config()
# Register a new datastore
blob_ds = Datastore.register_azure_blob_container(workspace=ws,
datastore_name='blob_data',
container_name='data_container',
account_name='az_store_acct',
account_key='11223312345cadas6abcde789…')

To set or change the default datastore —

ws.set_default_datastore('blob_data')

Azure Machine Learning Service — Where is My Data?

View from Azure ML — Datastores

Upload files to Datastores

Upload the local files from the local system to the remote data store. This allows the experiments to directly run using remote data location. The target_path is the path of the files at a remote datastore location. The ‘Reference Path’ is returned once the files are uploaded to the datastore. When you want to use a Datastore in an experiment script, we must pass a data reference to the script

default_ds.upload_files(files=['../input/iris-flower-dataset/IRIS.csv'],target_path='flower_data/',overwrite=True, show_progress=True)flower_data_ref = default_ds.path('flower_data').as_download('ex_flower_data')
print('reference_path = ',flower_data_ref)

Experiment with Data Store

Once we have the reference to the Datastore as mentioned above, we need to pass this reference to an experiment script, as a script parameter from the Estimator. Thereafter, the value of this parameter can be retrieved and then used as a local folder —

####run = Run.get_context()#define the regularization parameter for the logistic regression.
parser = argparse.ArgumentParser()
parser.add_argument('--reg_rate', type=float, dest='reg', default=0.01)
#define the data_folder parameter for referencing the path of the registerd datafolder.
parser.add_argument('--data_folder', type=str, dest='data_folder', help='Data folder reference', default=0.01)
args=parser.parse_args()
r = args.reg
ex_data_folder = args.data_folder
####

Define the estimator as —

####run = Run.get_context()estimator = SKLearn(source_directory=experiment_folder,
entry_script='iris_simple_experiment.py',
compute_target='local',
use_docker=False,
script_params = {'--reg_rate': 0.07,
'--data_folder':flower_data_ref}
# assigned reference path value as defined above.
)
####

The ‘ — data_folder’ accepts the datastore folder reference; path where files are uploaded. The script will load the training data from the data reference passed to it as a parameter, hence we need to set up the script parameters to pass the file reference to run the experiment.

Datasets

Datasets are packaged data objects that are readily consumable for machine learning pipelines. It is the recommended way to work with data. Help in enabling data labeling, versioning, and drift monitoring (to be discussed in upcoming posts). Datasets are defined from the location of data stored already in the ‘DataStores’

Dataset Types— We can create two types of datasets

  • Tabular: a structured form of data mostly read from a table, CSV, RDBMS, etc… imported from the Datastores. Example — Dataframe for a regression problem.
  • File: This is for unstructured datatypes, the list of file paths can be used through Datastores. An example use case is reading images to train a CNN.

Create Dataset

The Datasets are first created from the Datastores and the need to be registered. The example below shows the creation of both — tabular and file datasets.

# Creating tabular dataset from files in datastore.
tab_dataset = Dataset.Tabular.from_delimited_files(path=(default_ds,'flower_data/*.csv'))
tab_dataset.take(10).to_pandas_dataframe()# similarly, creating files dataset from the files already in the datastore. Useful in scenarios like image processing in deeplearning.file_dataset = Dataset.File.from_files(path=(default_ds,'flower_data/*.csv'))for fp in file_dataset.to_path():
print(fp)

Register Datasets

Once the Datasets are defined, they are required to be attached to the AML workspace. Also, with this, the meta-data such as name, description, tags, and version of the dataset is appended. Versioning of Datasets allows us to keep track of the dataset on which the experiment is trained. Versioning is also useful if we want to train the model on any specific version. Following are the ways to register the Tabular and File datasets:

# register tabular dataset
tab_dataset = tab_dataset.register(workspace=ws, name='flower tab ds', description='Iris flower Dataset in tabular format', tags={'format':'CSV'}, create_new_version=True)
#register File Dataset
file_dataset = file_dataset.register(workspace=ws, name='flower Files ds', description='Iris flower Dataset in Files format', tags={'format':'CSV'}, create_new_version=True)

Azure Machine Learning Service — Where is My Data?

Registered Datasets in Azure Machine Learning Service
print("Datasets Versions:")
for dataset_name in list(ws.datasets.keys()):
dataset = Dataset.get_by_name(ws, dataset_name)
print("\t", dataset.name, 'version', dataset.version)
Output >>
Datasets Versions:
flower Files ds version 1
flower tab ds version 1

Experiment with the Datasets

  • In the training script, that trains a classification model, the tabular dataset is passed to it as input by reading the dataset as a Pandas Dataframe
data = run.input_datasets['flower_ds'].to_pandas_dataframe()
  • Use the inputs parameter of the SKLearn estimator to pass the registered dataset, which is to be consumed by the training script.
inputs=[tab_dataset.as_named_input(‘flower_ds’)]
  • Also, using pip_packages additional parameters to enable the runtime environment to provision required package to support AML pandas operations. Since the script will need to work with a Dataset object, you must include either the full azureml-sdk package or the azureml-dataprep package with the P andas extra library in the script’s computing environment.
pip_packages=[‘azureml-dataprep[pandas]’]

Conclusion

Hereby, we covered one of the most important aspects of AML service, i.e., data management for the models and experiments. Data is a fundamental element in any machine learning workload, therefore, here we learned how to create and manage datastores and datasets in an Azure Machine Learning workspace, and how to use them in model training experiments.

In the next post, we will touch base upon the concepts related to various computing environments supported by AML service. Till then stay tuned!

References

[1] Notebook & Code — Azure Machine Learning — Working with Data , Kaggle.

[2] Azure ML Service, Data reference guide , Official Documentation, Microsoft Azure.

[3] Azure Machine Learning Service Official Documentation, Microsoft Azure.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

世界是平的

世界是平的

[美] 托马斯·弗里德曼 / 何帆、肖莹莹、郝正非 / 湖南科学技术出版社 / 2006-11 / 56.00元

当学者们讨论世界这20年发展的历史,并把目光聚集在2000年到2004年3月这一段时间时,他们将说些什么?9·11恐怖袭击还是伊拉克战争?或者,他们将讨论:科技的汇集与传播使得印度、中国和许多发展中国家成为世界商品和服务产品供给链上的一员,从而为世界大的发展中国家中的中产阶级带来了大量的财富,使这两个国家在全球化浪潮中占据更有利的位置?随着世界变得平坦,我们必须以更快的速度前进,才能在竞争中赢得胜......一起来看看 《世界是平的》 这本书的介绍吧!

在线进制转换器
在线进制转换器

各进制数互转换器

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具