内容简介:The experience while accessing the AI platform and running machine learning (ML) training code on the platform must be smooth and easy for the researchers. Migrating any ML code from a local environment to the platform should not require any refactoring of
The experience while accessing the AI platform and running machine learning (ML) training code on the platform must be smooth and easy for the researchers. Migrating any ML code from a local environment to the platform should not require any refactoring of the code at all. Infrastructure configuration overhead should be minimal. Our mission while developing PyKrylov was to abstract the ML logic from the infrastructure and Krylov core components (Figure 1) as much as possible in order to achieve the best experience for the platform users.
Figure 1. Simple layered representation of Krylov components.
PyKrylov is the pythonic interface to Krylov that is used by researchers and engineers company wide to access eBay’s AI platform. PyKrylov was built by researchers for researchers, and has increased the productivity of researchers wanting to use Krylov’s powerful compute resources. Onboarding new and existing code to the platform is as easy as writing a few lines of additional code to the ML logic without changing the existing implementation. The overhead that comes with PyKrylov is minimal, as shown in the below Hello World! example. The user can develop and start the code in her local environment, but the hello_world
function will be executed on the Krylov platform and not in users’ local environments.
def hello_world(): print('Hello World!') if __name__ == '__main__': import pykrylov session = pykrylov.Session() task = pykrylov.Task(hello_world) run_id = session.submit(task)
PyKrylov is ML framework-agnostic and works with any machine learning library that is available on the market, including but not limited to PyTorch and Tensorflow. It is even possible to run tasks on Hadoop and Spark. The user can configure each task in different ways, such as specifying a custom Docker image, specifying compute resources, execution parameters, etc. In the below example, the task is configured to run on a V100 GPU resource using a PyTorch docker image. The users do the configuration via simple Python code, and they don’t have to deal with additional JSON or YAML files.
task.run_on_gpu(model='v100') task.add_execution_parameter('docker', 'pytorch_image:latest')
Workflows
Model training is usually a multi-step workflow that is represented using a collection of tasks and dependencies specified as a directed acyclic graph (DAG). Creating workflows in PyKrylov can be achieved in a very natural way. Figure 2 shows a simple sequential workflow where the execution starts with data_prep
and ends with the finish
task.
Figure 2. A sequential workflow representing a simple ML pipeline.
In PyKrylov, the above workflow can be simply created using an OrderedDict
class that comes with Python. In the below code fragment, this time the Session.submit()
function submits the workflow to the AI platform instead of submitting a single task.
def data_prep(): print('do some data prepping') def train(): print('train a cool model') def test(): print('test your model') def finish(): print('do some closing work') task_data = pykrylov.Task(data_prep) task_train = pykrylov.Task(train) task_test = pykrylov.Task(test) task_finish = pykrylov.Task(finish) seq_workflow = OrderedDict({ task_data: [task_train], task_train: [task_test], task_test: [task_finish] }) run_id = session.submit(seq_workflow)
It is also possible to submit tasks that are implemented in a programming language other than Python. Through a bash script, the ShellTask
class in PyKrylov enables the user to run code in any programming language preferred by the user.
Additionally, you can convert the sequential workflow into a parallel workflow using the parallelize()
function that comes with PyKrylov.
parallel_wf = pykrylov.parallelize( workflow=seq_workflow, start=train, end=test, parameter='lr', value_list=[0.1, 0.3, 0.5] )
The DAG representation of parallel_wf
generated after the above code snippet is shown in Figure 3. The workflow starts with the data_prep
task and after the completion of the task, three parallel flows are started with the train
function. The finish
function is executed only after all three instances of the test
function are completed.
Figure 3. A simple parallel workflow for hyperparameter tuning.
It is also possible to define parallel workflows from scratch with OrderedDict
definitions. However, PyKrylov users prefer to use the workflow modification functions to create parallel workflows. Bigger workflows can be created by chaining the parallelize()
function for every hyperparameter (e.g. batch size and dimension). Another way of easily creating hyperparameter tuning workflows is to use the grid_search()
, random_search()
and parameter_grid()
functions that are implemented in PyKrylov similar to the scikit learn package.
parallel_wf = pykrylov.grid_search( workflow=seq_workflow, parameters_list = pykrylov.parameter_grid({ 'lr' = [0.1, 0.3, 0.5], 'dim' = [100, 200, 300], }), start=train, end=test )
The final workflow looks like the DAG depicted in Figure 4.
Figure 4. A complex parallel workflow for hyperparameter tuning.
Manage Workflow Status
In PyKrylov, tracking and managing workflow status after submission are straightforward. Session.job_show() shows the status of each task and the overall status of the run. Session.job_pause()
, Session.job_stop()
, and Session.job_resume()
allow the users to pause, stop, or resume the runs. When a task is pending, Session.job_info()
is very useful to peek at what is going on, e.g. if it is waiting for resources.
session = pykrylov.Session() response = session.job_show(run_id) session.job_stop(run_id) session.job_resume(run_id)
Distributed Training
Distributed Training leverages multiple machines to reduce training time. Krylov supports popular frameworks like TensorFlow, PyTorch, Keras, or Horovod, which support distributed training natively. Krylov provides stable IPs for pods in a distributed training job, and if a pod goes down during training, Krylov brings it back and provides the same IP so that the training can resume.
The pykrylov.distributed
package allows users to launch distributed training workflows on Krylov with their distributed training code in the framework they like. The experience is similar to launching non-distributed training workflows, but PyKrylov automatically generates the configuration files needed for parallelism and services which come with the stable pod IPs. The DistributedTask class enables users to run distributed training implemented in Python, and the DistShellTask
class enables users to run distributed training implemented in other languages, as long as it can be started in a shell. Below we show two sample code snippets, one submitting a DT run from Python implemented function mnist_train
, and the latter creating a DT run from shell scripts run_chief.sh
and worker.sh
.
from mnist import train as mnist_train parallelism = 2 task = pykrylov.distributed.DistributedTask(mnist_train, name='mnist_task_name', parallelism=parallelism) task.add_service(name=master_service_name, port=2020) task.run_on_gpu(quantity=1, model='m40') session = pykrylov.Session() run_id = session.submit(task)
from collections import OrderedDict chief_task = pykrylov.distributed.DistShellTask('run_chief.sh', name='chief_task_name', parallelism=1, service_name='chiefSVC', service_port=22) # Another way to specify service chief_task.run_on_gpu(quantity=1, model='p100') worker_task = pykrylov.distributed.DistShellTask('worker.sh', name='worker_task_name', parallelism=3, service_name='workerSVC', service_port=22) worker_task.run_on_gpu(quantity=1, model='p100') session = pykrylov.Session() workflow = OrderedDict({chief_task:[], worker_task:[]}) run_id = session.submit(workflow)
Experiment Management System (EMS)
The search for the best model includes multiple iterations of hyperparameter tuning and running multiple experiments in parallel, and comparing the results obtained in each of them. Before the Experiment Management System (EMS), Krylov users had to do manual bookkeeping of the hyperparameters, workflow information, and other metadata related to the training. EMS provides the ability to track the experiments, manage logs, and manage generated models — regardless of whether the model will be picked for production or not — and visualize training status and logs on Krylov dashboard. Moreover, users can record and visualize computed metrics such as loss and precision values with timestamps.
pykrylov.ems
provides a simple pythonic way to create and update experiments, and associate metadata, logs, models, metrics or other files users generate as assets with the experiments.
config = { 'lr': 0.0001, 'model': 'CNN', 'dataset': 'mnist', 'epochs': 100 } def train(config, model, optimizer, mnist_dataset): exp_id = pykrylov.ems.create_experiment('my_project', 'my_experiment', configurations=config) step = 0 for epoch in range(config['epochs']): for batch in mnist_dataset.train: output = model(batch.data) loss = calc_loss(output, batch.target) pykrylov.ems.write_metric(exp_id, name='loss', value=loss, dimension={'step': step}) loss.backward() optimizer.step() step += 1 precision = calc_precision(mnist_dataset.dev) pykrylov.ems.write_metric(exp_id, name='dev_precision', value=precision)
Model Management System (MMS)
Trained models need to be accessible for inference in a production environment. Versioning and tagging the models, as well as recording the metadata associated with the model (e.g. accuracy, training dataset, hyperparameters) is necessary. Without a Model Management System (MMS), this task can become daunting for data scientists, as it requires manual and complicated solutions. For this purpose, the Krylov MMS system is developed to provide a centralized solution to store models, where versioning models and bookkeeping metadata are supported and are seamlessly integrated with training and inferencing. With the pykrylov.mms
module in PyKrylov, data scientists can push models to MMS during training at ease. The pykrylov.mms
module can also be used locally to upload and download models to/from MMS. The module also provides model discoverability capability to users.
revision = pykrylov.mms.create_model('my_project', 'my_model_name', model_file_list, 'my_tag') print(pykrylov.mms.show_model('my_project', 'my_model_name', 'my_tag', revision=revision) pykrylov.mms.download_revision('my_project', 'my_model_name', 'my_tag','save_to_dir', latest=True)
Conclusions
We have presented PyKrylov and shown how it accelerates machine learning research at eBay. Submitting ML tasks is simplified and configuration overhead is reduced. The user can onboard her code to the platform in a few lines of Python code. In our journey to democratize machine learning, this is only half of the story. Next step for us is to provide researchers the necessary tools for specific domains like NLP and CV. We will provide more details about this in another blog article.
以上所述就是小编给大家介绍的《PyKrylov: Accelerating Machine Learning Research at eBay》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Complexity and Approximation
G. Ausiello、P. Crescenzi、V. Kann、Marchetti-sp、Giorgio Gambosi、Alberto M. Spaccamela / Springer / 2003-02 / USD 74.95
This book is an up-to-date documentation of the state of the art in combinatorial optimization, presenting approximate solutions of virtually all relevant classes of NP-hard optimization problems. The......一起来看看 《Complexity and Approximation》 这本书的介绍吧!