利用LXD容器构建共享的GPU服务器

栏目: IT技术 · 发布时间: 4年前

内容简介：多人共享的GPU服务器最大的痛点在于，每个人都希望拥有root权限并且过度自信。笔者接手实验室的服务器管理以来，尝试LXD容器作为虚拟化方案至今已有一年多。机器多为4卡TitanXP或4卡2080Ti，近30个人共享使用。总体上，LXD虚拟化方案运行稳定，使用方便，配合一系列脚本，能够极大的解放管理员，降低工作量。Google搜索LXD+GPU能够找到大量的中英文资料，因此本文只会简述安装和配置过程，着重分享不同环境下的挑战和解决方案。

多人共享的GPU服务器最大的痛点在于，每个人都希望拥有root权限并且过度自信。

笔者接手实验室的服务器管理以来，尝试LXD容器作为虚拟化方案至今已有一年多。机器多为4卡TitanXP或4卡2080Ti，近30个人共享使用。总体上，LXD虚拟化方案运行稳定，使用方便，配合一系列脚本，能够极大的解放管理员，降低工作量。

Google搜索LXD+GPU能够找到大量的中英文资料，因此本文只会简述安装和配置过程，着重分享不同环境下的挑战和解决方案。

笔者也将说明文档共享出来，供大家参考使用： https://deserts.gitbook.io/gpu/manual

安装配置

宿主机驱动配置

笔者习惯使用Ubuntu服务器版系统。首先安装英伟达显卡驱动，CUDA在宿主机上并非必要。

apt install git gcc g++ make cmake build-essential curl -y
apt-get remove --purge nvidia* -y

#把 nouveau 驱动加入黑名单并禁用用 nouveau 内核模块
# 在文件 blacklist-nouveau.conf 中加入如下内容
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist-nouveau.conf
echo "options nouveau modeset=0" >> /etc/modprobe.d/blacklist-nouveau.conf

# 保存退出，执行
update-initramfs -u

#给驱动run文件赋予执行权限：
sudo chmod +x NVIDIA-Linux-x86_64-<版本>.run
#后面的参数非常重要，不可省略：
sudo ./NVIDIA-Linux-x86_64-<版本>.run --no-opengl-files

安装nvidia-container-runtime，这样在容器中可以直接调用宿主机的显卡驱动。

# Add the package repositories
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list
apt-get update

apt install libnvidia-container-dev libnvidia-container-tools nvidia-container-runtime -y

宿主机LXD配置

安装ZFS并配置，使用ZFS作为LXD的存储管理工具。ZFS文件系统开启去重。

apt install zfsutils-linux

zpool create tank /dev/sda
zfs create tank/lxd
zfs set dedup=on tank/lxd

安装LXD

snap install lxd

换源和拉取镜像

lxc remote add tuna-images https://mirrors.tuna.tsinghua.edu.cn/lxc-images/ --protocol=simplestreams --public
lxc image copy tuna-images:ubuntu/18.04 local: --alias ubuntu/18.04 --copy-aliases --public

初始化LXD，注意ZFS pool使用上面创建的，是否使用网桥视情况而定。

lxd init

Would you like to use LXD clustering? (yes/no) [default=no]:
Do you want to configure a new storage pool? (yes/no) [default=yes]:
Name of the new storage pool [default=default]: 
Name of the storage backend to use (btrfs, ceph, dir, lvm, zfs) [default=zfs]:
Create a new ZFS pool? (yes/no) [default=yes]:
Would you like to use an existing block device? (yes/no) [default=no]:
Size in GB of the new loop device (1GB minimum) [default=100GB]: 
Would you like to connect to a MAAS server? (yes/no) [default=no]:
Would you like to create a new local network bridge? (yes/no) [default=yes]: no
Would you like to configure LXD to use an existing bridge or host interface? (yes/no) [default=no]: yes
Name of the existing bridge or host interface: br0
Would you like LXD to be available over the network? (yes/no) [default=no]:
Would you like stale cached images to be updated automatically? (yes/no) [default=yes]
Would you like a YAML "lxd init" preseed to be printed? (yes/no) [default=no]:

lxc profile set default nvidia.runtime true
lxc profile device add default gpu gpu

创建模板容器

lxc init ubuntu/18.04 template -p default
lxc start template

进入模板容器内安装必要的软件，如conda等

lxc exec template bash

完成后，发布将模板容器发布为模板镜像，删除模板容器

sudo lxc stop template
sudo lxc publish template --alias template --public
sudo lxc rm template

自动化脚本

利用笔者提供的 shell 脚本，即可实现用户创建时自动创建容器。使用add_user.s新建用户及容器，用户登录宿主机后执行login.sh。

新建用户：

ssh addu@172.26.xxx.xxx
# 密码
addu@172.26.xxx.xxx's password:
=====Welcome!
We need to get sudo permission first. Enter the password for `addu`.
# 输入addu的密码，获取sudo权限
[sudo] password for addu:
=====Let's setup a new account and create a container now.
# 输入用户名，接下来自动创建用户并新建虚拟机
Enter your username: test
Creating user...
Allocating container for test...
Creating test
Allocating ssh port... 10020
Device sshproxy added to test
# 设置用户密码
set password for test now (host only).
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
Login this host via `ssh <username>@<host-ip>` to manage your container.
Done!

用户登录：

# 使用新建的用户登陆并管理虚拟机
ssh test@172.26.xxx.xxx
test@172.26.xxx.xxx's password:
Welcome to Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-54-generic x86_64)
……
 Hi, test
 You're using the GPU Server in Vision Group.

==========About your container:
Your container is not running.
Transfer data to your container using scp or sftp;
File sharing is encouraged, access datasets at shared/datasets, access download files at shared/downloads, etc

See GPU load: nvidia-smi.
    memory usage: free -h.
    disk usage: df -h.

===== main menu  =====
[1] start your container  # 开机
[2] enter your container  # 切换至虚拟机
[3] stop your container   # 关机（也可以直接在虚拟机中执行shutdown now）
[4] change your password  # 更改密码（如果需要改虚拟机密码，进入虚拟机后执行passwd）
[5] allocate ports        # 进行端口映射
[6] release ports         # 释放申请的端口
[0] show info             # 显示虚拟机运行状态
[x] exit                  # 退出管理
# 启动虚拟机
Enter your choice: 1
========== Starting your container...

Press any key to continue...

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

马化腾自述-我的互联网思维

赵黎 / 石油工业出版社 / 2014-8-1 / 35

马化腾自述:我的互联网思维》讲述了些人说移动互联网就是加了“移动”两个字，互联网十几年了，移动互联网应该是个延伸。我的感受是，移动互联网远远不只是一个延伸，甚至是一个颠覆。互联网是一个开放交融、瞬息万变的大生态，企业作为互联网生态里面的物种，需要像自然界的生物一样，各个方面都具有与生态系统汇接、和谐、共生的特性。开放和分享并不是一个宣传口号，也不是一个简单的概念。开放很多时候被看作一种姿态，但是我......一起来看看《马化腾自述-我的互联网思维》这本书的介绍吧!

码农工具