Azure Machine Learning Service — Where is My Data?
Part 4: An introduction to Datasets & Datastores
Jul 5 ·5min read
Abstract
Perhaps this is one of the important aspects of the Cloud machine learning platform. The dire need for every machine learning problem is its data. Mostly this data comes from various diverse sources, it is then refined, pruned, and massaged for various analyses and its consumption into the ML model. Due to this reason, the cloud machine learning defines data management SDK’s which can define, secure, and manage the data sources in the cloud platform.
This particular post is extracted from the Kaggle notebook hosted — here . Use the link to setup to execute the experiment.
Datastores and Datasets
Datastores
Datastores is a data management capability and the SDK provided by the Azure Machine Learning Service (AML). It enables us to connect to the various data sources and then those can be used to ingest them into the ML experiment or write outputs from the same experiments. Azure provides various platform services that can be enabled as a data source, e.g., blob store, data lake, SQL database, Databricks, and many others.
The Azure ML workspace has a natural integration with the datastores defined in Azure, such as Blob Storage and File Storage. But, executing the ML model may require data and its dependencies from other external sources. Hence, the AML SDK provides us the way to register these external sources as a Datasource for model experiments. The ability to define a datastore enables us to reuse the data across multiple experiments, regardless of the compute context in which the experiment is running.
Register Datastores
As discussed, Datastoes are of two types — Default and user provisioned, such as Storage Blobs containers or file storage. To get the list of default Datasores of a workspace:
# get the name of defult Datastore associated with the workspace.
default_dsname = ws.get_default_datastore().name
default_ds = ws.get_default_datastore()
print('default Datastore = ', default_dsname)
To register the database using AML SDK —
from azureml.core import Workspace, Datastore
ws = Workspace.from_config()# Register a new datastore
blob_ds = Datastore.register_azure_blob_container(workspace=ws,
datastore_name='blob_data',
container_name='data_container',
account_name='az_store_acct',
account_key='11223312345cadas6abcde789…')
To set or change the default datastore —
ws.set_default_datastore('blob_data')
Upload files to Datastores
Upload the local files from the local system to the remote data store. This allows the experiments to directly run using remote data location. The target_path is the path of the files at a remote datastore location. The ‘Reference Path’ is returned once the files are uploaded to the datastore. When you want to use a Datastore in an experiment script, we must pass a data reference to the script
default_ds.upload_files(files=['../input/iris-flower-dataset/IRIS.csv'],target_path='flower_data/',overwrite=True, show_progress=True)flower_data_ref = default_ds.path('flower_data').as_download('ex_flower_data')
print('reference_path = ',flower_data_ref)
Experiment with Data Store
Once we have the reference to the Datastore as mentioned above, we need to pass this reference to an experiment script, as a script parameter from the Estimator. Thereafter, the value of this parameter can be retrieved and then used as a local folder —
####run = Run.get_context()#define the regularization parameter for the logistic regression.
parser = argparse.ArgumentParser()
parser.add_argument('--reg_rate', type=float, dest='reg', default=0.01)#define the data_folder parameter for referencing the path of the registerd datafolder.
parser.add_argument('--data_folder', type=str, dest='data_folder', help='Data folder reference', default=0.01)
args=parser.parse_args()
r = args.reg
ex_data_folder = args.data_folder####
Define the estimator as —
####run = Run.get_context()estimator = SKLearn(source_directory=experiment_folder,
entry_script='iris_simple_experiment.py',
compute_target='local',
use_docker=False,
script_params = {'--reg_rate': 0.07,
'--data_folder':flower_data_ref}
# assigned reference path value as defined above.
)####
The ‘ — data_folder’ accepts the datastore folder reference; path where files are uploaded. The script will load the training data from the data reference passed to it as a parameter, hence we need to set up the script parameters to pass the file reference to run the experiment.
Datasets
Datasets are packaged data objects that are readily consumable for machine learning pipelines. It is the recommended way to work with data. Help in enabling data labeling, versioning, and drift monitoring (to be discussed in upcoming posts). Datasets are defined from the location of data stored already in the ‘DataStores’
Dataset Types— We can create two types of datasets
- Tabular: a structured form of data mostly read from a table, CSV, RDBMS, etc… imported from the Datastores. Example — Dataframe for a regression problem.
- File: This is for unstructured datatypes, the list of file paths can be used through Datastores. An example use case is reading images to train a CNN.
Create Dataset
The Datasets are first created from the Datastores and the need to be registered. The example below shows the creation of both — tabular and file datasets.
# Creating tabular dataset from files in datastore.
tab_dataset = Dataset.Tabular.from_delimited_files(path=(default_ds,'flower_data/*.csv'))tab_dataset.take(10).to_pandas_dataframe()# similarly, creating files dataset from the files already in the datastore. Useful in scenarios like image processing in deeplearning.file_dataset = Dataset.File.from_files(path=(default_ds,'flower_data/*.csv'))for fp in file_dataset.to_path():
print(fp)
Register Datasets
Once the Datasets are defined, they are required to be attached to the AML workspace. Also, with this, the meta-data such as name, description, tags, and version of the dataset is appended. Versioning of Datasets allows us to keep track of the dataset on which the experiment is trained. Versioning is also useful if we want to train the model on any specific version. Following are the ways to register the Tabular and File datasets:
# register tabular dataset
tab_dataset = tab_dataset.register(workspace=ws, name='flower tab ds', description='Iris flower Dataset in tabular format', tags={'format':'CSV'}, create_new_version=True)
#register File Dataset
file_dataset = file_dataset.register(workspace=ws, name='flower Files ds', description='Iris flower Dataset in Files format', tags={'format':'CSV'}, create_new_version=True)
print("Datasets Versions:")
for dataset_name in list(ws.datasets.keys()):
dataset = Dataset.get_by_name(ws, dataset_name)
print("\t", dataset.name, 'version', dataset.version)Output >>
Datasets Versions:
flower Files ds version 1
flower tab ds version 1
Experiment with the Datasets
- In the training script, that trains a classification model, the tabular dataset is passed to it as input by reading the dataset as a Pandas Dataframe
data = run.input_datasets['flower_ds'].to_pandas_dataframe()
- Use the inputs parameter of the SKLearn estimator to pass the registered dataset, which is to be consumed by the training script.
inputs=[tab_dataset.as_named_input(‘flower_ds’)]
- Also, using pip_packages additional parameters to enable the runtime environment to provision required package to support AML pandas operations. Since the script will need to work with a Dataset object, you must include either the full azureml-sdk package or the azureml-dataprep package with the P andas extra library in the script’s computing environment.
pip_packages=[‘azureml-dataprep[pandas]’]
Conclusion
Hereby, we covered one of the most important aspects of AML service, i.e., data management for the models and experiments. Data is a fundamental element in any machine learning workload, therefore, here we learned how to create and manage datastores and datasets in an Azure Machine Learning workspace, and how to use them in model training experiments.
In the next post, we will touch base upon the concepts related to various computing environments supported by AML service. Till then stay tuned!
References
[1] Notebook & Code — Azure Machine Learning — Working with Data , Kaggle.
[2] Azure ML Service, Data reference guide , Official Documentation, Microsoft Azure.
[3] Azure Machine Learning Service Official Documentation, Microsoft Azure.
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
ES6 标准入门(第2版)
阮一峰 / 电子工业出版社 / 2016-1 / 69.00元
ES6(又名 ES2105)是 JavaScript 语言的新标准,2015 年 6 月正式发布后,得到了迅速推广,是目前业界超级活跃的计算机语言。《ES6标准入门(第2版)》是国内仅有的一本 ES6 教程,在前版基础上增补了大量内容——对标准进行了彻底的解读,所有新增的语法知识(包括即将发布的 ES7)都给予了详细介绍,并且紧扣业界开发实践,给出了大量简洁易懂、可以即学即用的示例代码。 《......一起来看看 《ES6 标准入门(第2版)》 这本书的介绍吧!