分布式机器学习与同态加密-part2

栏目: 后端 · 发布时间: 6年前

内容简介：所有的据科学家都会告诉你，数据集是人工智能(AI)的命脉。在之前的文章中，我们演示了使用python-paillier 库来实现联合学习的简单安全协议。在这篇文章中，我们将探讨如何使用加密模型对远程数据进行评分。在之前的文章中，我们演示了使用python-paillier 库来实现联合学习的简单安全协议。在这篇文章中，我们将探讨如何使用加密模型对远程数据进行评分。此技术解决方案的可行性非常有趣并且出于隐私原因而相关。这意味着模型的所有者（以及训练数据）不需要破坏远程数据所有者的隐私，以便对其数据进行评分;

分布式机器学习与同态加密-part2

所有的据科学家都会告诉你，数据集是人工智能(AI)的命脉。在之前的文章中，我们演示了使用python-paillier 库来实现联合学习的简单安全协议。在这篇文章中，我们将探讨如何使用加密模型对远程数据进行评分。

使用加密模型进行预测

在之前的文章中，我们演示了使用python-paillier 库来实现联合学习的简单安全协议。在这篇文章中，我们将探讨如何使用加密模型对远程数据进行评分。此技术解决方案的可行性非常有趣并且出于隐私原因而相关。这意味着模型的所有者（以及训练数据）不需要破坏远程数据所有者的隐私，以便对其数据进行评分;反之亦然，远程数据所有者对有关评分模型（以及训练数据）的任何信息视而不见，因为模型本身是加密的。

我们将假设对Paillier密码系统以及逻辑回归有所了解。这个例子的灵感来自博客文章@iamtrask。

我们使用Enron垃圾邮件数据集的子集。Alice在她拥有的电子邮件上训练垃圾邮件分类器。她希望将其应用于Bob的个人电子邮件，而不是：

要求鲍勃在任何地方发送他的电子邮件。
泄漏有关她所学习的模型或数据集的信息。
让鲍勃知道他的哪些电子邮件是垃圾邮件。完整的代码可以在github上找到。

首先，我们进行必要的导入并包装下载和准备数据的代码。

import time

import os.path

from zipfile import ZipFile

from urllib.request import urlopen

from contextlib import contextmanager

import numpy as np

from sklearn.linear_model import LogisticRegression

from sklearn.feature_extraction.text import CountVectorizer

import phe as paillier

np.random.seed(42)

# Enron spam dataset hosted by https://cloudstor.aarnet.edu.au

url = [ 'https://cloudstor.aarnet.edu.au/plus/index.php/s/RpHZ57z2E3BTiSQ/download', 'https://cloudstor.aarnet.edu.au/plus/index.php/s/QVD4Xk5Cz3UVYLp/download' ]

def download_data():

“““Download two sets of Enron1 spam/ham e-mails if they are not here We will use the first as trainset and the second as testset. Return the path prefix to us to load the data from disk.““”

n_datasets = 2

for d in range(1, n_datasets + 1):

if not os.path.isdir('enron%d' % d):

URL = url[d-1]

print(“Downloading %d/%d: %s” % (d, n_datasets, URL))

folderzip = 'enron%d.zip' % d

with urlopen(URL) as remotedata:

with open(folderzip, 'wb') as z: z.write(remotedata.read())

with ZipFile(folderzip) as z:

z.extractall() os.remove(folderzip)

为了简单起见，电子邮件被表示为限制词汇表中单词的向量，其中每个特征值都计算一个单词在电子邮件中出现的时间。我们为此使用了一个

CountVectorzer。

def preprocess_data():

““” Get the Enron e-mails from disk. Represent them as bag-of-words. Shuffle and split train/test. ““” print(“Importing dataset from disk…“)

path = 'enron1/ham/'

ham1 = [open(path + f, 'r', errors='replace').read().strip(r”\n”)

for f in

os.listdir(path) if

os.path.isfile(path + f)]

path = 'enron1/spam/'

spam1 = [open(path + f, 'r', errors='replace').read().strip(r”\n”)

for f in

os.listdir(path) if

os.path.isfile(path + f)]

path = 'enron2/ham/'

ham2 = [open(path + f, 'r', errors='replace').read().strip(r”\n”)

for f in

os.listdir(path) if

os.path.isfile(path + f)]

path = 'enron2/spam/'

spam2 = [open(path + f, 'r', errors='replace').read().strip(r”\n”)

for f in

os.listdir(path) if

os.path.isfile(path + f)]

# Merge and create labels

emails = ham1 + spam1 + ham2 + spam2

y = np.array([-1] * len(ham1) + [1] * len(spam1) + [-1] * len(ham2) + [1] * len(spam2))

# Words count, keep only frequent words

count_vect = CountVectorizer(decode_error='replace', stop_words='english', min_df=0.001)

X = count_vect.fit_transform(emails)

print('Vocabulary size: %d' % X.shape[1])

# Shuffle

perm = np.random.permutation(X.shape[0])

X, y = X[perm, :], y[perm]

# Split train and test

split = 500

X_train, X_test = X[-split:, :], X[:-split, :]

y_train, y_test = y[-split:], y[:-split]

print(“Labels in trainset are {:.2f} spam : {:.2f} ham”.format( np.mean(y_train == 1),

np.mean(y_train == -1))) return X_train, y_train, X_test, y_test

该方案的工作原理如下。Alice根据她拥有的数据对垃圾邮件分类器进行逻辑回归训练。学习之后，她使用Paillier密码方案生成公钥/私钥对。使用公钥加密模型。公钥和加密模型被发送给Bob。Bob将加密模型应用于他自己的数据，获取每封电子邮件的加密分数。Bob将这些加密的分数发送给Alice。Alice用私钥解密它们以获得垃圾邮件与垃圾邮件的预测。

该协议满足上述三个条件。特别是，Bob只看到加密的模型和加密的分数，并且在不知道私钥的情况下无法从中获取任何内容。

现在来实施。Alice需要能够对明文数据进行逻辑回归，加密模型以供远程使用，并使用私钥解密加密的分数。

class Alice:

def __init__(self): self.model = LogisticRegression()

def generate_paillier_keypair(self, n_length): self.pubkey, self.privkey = \ paillier.generate_paillier_keypair(n_length=n_length)

def fit(self, X, y): self.model = self.model.fit(X, y)

def predict(self, X): return self.model.predict(X)

def encrypt_weights(self): coef = self.model.coef_[0, :] encrypted_weights = [self.pubkey.encrypt(coef[i])

for i in range(coef.shape[0])] encrypted_intercept = self.pubkey.encrypt(self.model.intercept_[0]) return encrypted_weights, encrypted_intercept

def decrypt_scores(self, encrypted_scores):

return [self.privkey.decrypt(s)

for s in encrypted_scores]

Bob获得加密模型和公钥。他必须能够使用加密模型对本地明文数据进行评分，但如果没有Alice持有的私钥，则无法解密分数。

class Bob:

def __init__(self, pubkey): self.pubkey = pubkey

def set_weights(self, weights, intercept):

self.weights = weights

self.intercept = intercept

def encrypted_score(self, x): “““Compute the score of `x` by multiplying with the encrypted model, which is a vector of `paillier.EncryptedNumber`““”

score = self.intercept _, idx = x.nonzero()

for i in idx: score += x[0, i] * self.weights[i]

return score

def encrypted_evaluate(self, X): return

[self.encrypted_score(X[i, :]) for i in range(X.shape[0])]

让我们看看脚本在运行中。我们首先按顺序得到数据，并检验问题的维数：

download_data()
X, y, X_test, y_test = preprocess_data()
X.shape (500, 7994)

我们正在处理大约8000项功能。接下来，我们实例化Alice，它生成密钥对并在本地数据上拟合她的逻辑模型。

alice = Alice()
alice.generate_paillier_keypair(n_length=1024)
alice.fit(X, y)

尚未执行加密。让我们看看Alice的分类器的错误是_if_她可以访问Bob的原始（未加密）数据。当然，由于Bob的数据不可用，因此无法在实际场景中了解这一点。

np.mean(alice.predict(X_test) != y_test) 0.045683350745559882

现在，Alice加密分类器。

encrypted_weights, encrypted_intercept = alice.encrypt_weights()

我们用Alice的公钥实例化Bob。Bob使用加密分类器进行评分。

bob = Bob(alice.pubkey)
bob.set_weights(encrypted_weights, encrypted_intercept)
encrypted_scores = bob.encrypted_evaluate(X_test)

让我们看看其中一个加密分数是怎样的。

print(encrypted_scores[0].ciphertext()) 4975557101598019607333115657955782044002134197013151844631125970114580057948777697681679333578395930647500175104718976826465398554390717765586649503985800812276599674119580862642667636337378406851541955675614078001941547394030888287811317521894539431449722023192072949095429036555137484530752817765976765269293455734683337022787581827841503790798807907517815490376905382493360989832127082449724104557596689227300380104999472764265118788640333048806552912736240459059453425987302997946039793991525213509904102136530661457492688678688561944802008308534596837051863930132631396095952823207091622450117172795188329566587

爱丽丝解密鲍勃的分数。

scores = alice.decrypt_scores(encrypted_scores)
scores[:5] [-14.511058062671882, -9.188384491859484, -1.746647646814274, -16.91595050694431, -6.716934039494412]

这些分数的符号相当于预测的等级。作为一个正确的检查，让我们看看这个模型的错误是什么。请记住，爱丽丝并不了解这一点，因为爱丽丝不拥有鲍勃的地面真相标签。错误与上面相同。

np.mean(np.sign(scores) != y_test) 0.045683350745559882 此处（一个链接）

提供了第二个示例的完整代码，运行时它将输出与协议的每个步骤相关的时序信息。

您可能会问：此协议和上一篇文章中的协议是否可以合并？事实上，他们可以模仿前者进行分类而后者进行回归的事实。原则上，您可以设置联合学习方案，其中由客户端培训的模型以加密形式远程部署，然后将预测发送回该客户端。

你可能还会喜欢：

比特币网络动量——比特币价格在主要市场周期中的一个新的领先指标

你需要区块链吗？——各种有趣的判断模型

什么是量子密码学

金钱的本质是什么——为什么比特币不能解决根本问题

加密货币的熊市案例

革命与进化：全同态加密

欢迎收听“区块链杂谈”节目，国内最有质量的区块链知识分享节目。

分布式机器学习与同态加密-part2

宁波格密链网络科技有限公司，专注于区块链上的密码技术研发。

分布式机器学习与同态加密-part2

格密链专注于区块链上的密码学技术长按扫码可关注

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

调试九法

David J.Agans / 赵俐 / 人民邮电出版社 / 2010-12-7 / 35.00元

硬件缺陷和软件错误是“技术侦探”的劲敌，它们负隅顽抗，见缝插针。本书提出的九条简单实用的规则，适用于任何软件应用程序和硬件系统，可以帮助软硬件调试工程师检测任何bug，不管它们有多么狡猾和隐秘。作者使用真实示例展示了如何应用简单有效的通用策略来排查各种各样的问题，例如芯片过热、由蛋酒引起的电路短路、触摸屏失真，等等。本书给出了真正能够隔离关键因素、运行测试序列和查找失败原因的技术。 ......一起来看看《调试九法》这本书的介绍吧!

码农工具