内容简介:zabbix监控报警一台ceph节点journal盘写入寿命已经达到96%以上,根据intel官方说法写入寿命达到设置值将会无法正常写入。PercentageUsed : 97有12个osd用这块盘做的日志1,降低osd优先级
zabbix监控报警一台ceph节点journal盘写入寿命已经达到96%以上,根据intel官方说法写入寿命达到设置值将会无法正常写入。PercentageUsed : 97
[root@ceph-11 ~]# isdct show -sensor PowerOnHours : 0x021B5 EraseFailCount : 0 EndToEndErrorDetectionCount : 0 ReliabilityDegraded : False AvailableSpare : 100 AvailableSpareBelowThreshold : False DeviceStatus : Healthy SpecifiedPCBMaxOperatingTemp : 85 SpecifiedPCBMinOperatingTemp : 0 UnsafeShutdowns : 0x08 CrcErrorCount : 0 AverageNandEraseCycles : 2917 MediaErrors : 0x00 PowerCycles : 0x0C ProgramFailCount : 0 MaxNandEraseCycles : 2922 HighestLifetimeTemperature : 57 PercentageUsed : 97 ThermalThrottleStatus : 0 ErrorInfoLogEntries : 0x00 MinNandEraseCycles : 2913 LowestLifetimeTemperature : 23 ReadOnlyMode : False ThermalThrottleCount : 0 TemperatureThresholdExceeded : False Temperature - Celsius : 50
有12个osd用这块盘做的日志
[root@ceph-11 ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 5.5T 0 disk └─sda1 8:1 0 5.5T 0 part /var/lib/ceph/osd/ceph-87 sdb 8:16 0 5.5T 0 disk └─sdb1 8:17 0 5.5T 0 part /var/lib/ceph/osd/ceph-88 sdc 8:32 0 5.5T 0 disk └─sdc1 8:33 0 5.5T 0 part /var/lib/ceph/osd/ceph-89 sdd 8:48 0 5.5T 0 disk └─sdd1 8:49 0 5.5T 0 part /var/lib/ceph/osd/ceph-90 sde 8:64 0 5.5T 0 disk └─sde1 8:65 0 5.5T 0 part /var/lib/ceph/osd/ceph-91 sdf 8:80 0 5.5T 0 disk └─sdf1 8:81 0 5.5T 0 part /var/lib/ceph/osd/ceph-92 sdg 8:96 0 5.5T 0 disk └─sdg1 8:97 0 5.5T 0 part /var/lib/ceph/osd/ceph-93 sdh 8:112 0 5.5T 0 disk └─sdh1 8:113 0 5.5T 0 part /var/lib/ceph/osd/ceph-94 sdi 8:128 0 5.5T 0 disk └─sdi1 8:129 0 5.5T 0 part /var/lib/ceph/osd/ceph-95 sdj 8:144 0 5.5T 0 disk └─sdj1 8:145 0 5.5T 0 part /var/lib/ceph/osd/ceph-96 sdk 8:160 0 5.5T 0 disk └─sdk1 8:161 0 5.5T 0 part /var/lib/ceph/osd/ceph-97 sdl 8:176 0 5.5T 0 disk └─sdl1 8:177 0 5.5T 0 part /var/lib/ceph/osd/ceph-98 sdm 8:192 0 419.2G 0 disk └─sdm1 8:193 0 419.2G 0 part / nvme0n1 259:0 0 372.6G 0 disk ├─nvme0n1p1 259:1 0 30G 0 part ├─nvme0n1p2 259:2 0 30G 0 part ├─nvme0n1p3 259:3 0 30G 0 part ├─nvme0n1p4 259:4 0 30G 0 part ├─nvme0n1p5 259:5 0 30G 0 part ├─nvme0n1p6 259:6 0 30G 0 part ├─nvme0n1p7 259:7 0 30G 0 part ├─nvme0n1p8 259:8 0 30G 0 part ├─nvme0n1p9 259:9 0 30G 0 part ├─nvme0n1p10 259:10 0 30G 0 part ├─nvme0n1p11 259:11 0 30G 0 part └─nvme0n1p12 259:12 0 30G 0 part [root@ceph-11 ~]#
1,降低osd优先级
在大部分故障场景, 我们需要关机操作, 为了让用户无感知, 我们需要提前降低待操作的节点的优先级。首先看下ceph版本号,ceph版本为10.x. 我们启用了primary-affinity支持, 用户的io请求会先转给primary pg处理. 然后写入其他replica(副本).。先找出host ceph-11对应的osd,然后把这些osd的primary-affinity设为0, 意思就是上面的pg除非其他副本挂了, 否则不应该成为主pg.
-12 65.47299 host ceph-11 87 5.45599 osd.87 up 1.00000 0.89999 88 5.45599 osd.88 up 0.79999 0.29999 89 5.45599 osd.89 up 1.00000 0.89999 90 5.45599 osd.90 up 1.00000 0.89999 91 5.45599 osd.91 up 1.00000 0.89999 92 5.45599 osd.92 up 1.00000 0.79999 93 5.45599 osd.93 up 1.00000 0.89999 94 5.45599 osd.94 up 1.00000 0.89999 95 5.45599 osd.95 up 1.00000 0.89999 96 5.45599 osd.96 up 1.00000 0.89999 97 5.45599 osd.97 up 1.00000 0.89999 98 5.45599 osd.98 up 0.89999 0.89999
将osd87到98优先级设置为0
for osd in {87..98}; do ceph osd primary-affinity "$osd" 0; done
使用ceph osd tree可以看到对应的节点设置
-12 65.47299 host ceph-11 87 5.45599 osd.87 up 1.00000 0 88 5.45599 osd.88 up 0.79999 0 89 5.45599 osd.89 up 1.00000 0 90 5.45599 osd.90 up 1.00000 0 91 5.45599 osd.91 up 1.00000 0 92 5.45599 osd.92 up 1.00000 0 93 5.45599 osd.93 up 1.00000 0 94 5.45599 osd.94 up 1.00000 0 95 5.45599 osd.95 up 1.00000 0 96 5.45599 osd.96 up 1.00000 0 97 5.45599 osd.97 up 1.00000 0 98 5.45599 osd.98 up 0.89999 0
2,禁止踢出节点
ceph osd set noout
默认情况下, osd长时间无响应则会被自动踢出集群, 从而触发数据迁移. 关机更换ssd操作时间较长, 为了避免数据无意义地来回迁移, 我们需要临时禁止集群自动踢osd,使用ceph -s检查是否配置完成。可以看到集群状态变为WARN, 额外提示说noout flag被设置了, 而且flags这样多了一项
[root@ceph-11 ~]# ceph -s cluster 936a5233-9441-49df-95c1-01de82a192f4 health HEALTH_WARN noout flag(s) set monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0} election epoch 406, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6 fsmap e94: 1/1/1 up {0=ceph-1=up:active}, 1 up:standby osdmap e73511: 111 osds: 108 up, 108 in flags noout,sortbitwise,require_jewel_osds pgmap v85913863: 5064 pgs, 24 pools, 89164 GB data, 12450 kobjects 261 TB used, 141 TB / 403 TB avail 5060 active+clean 4 active+clean+scrubbing+deep client io 27608 kB/s rd, 59577 kB/s wr, 399 op/s rd, 668 op/s wr
3,检查pg是否完成切换
[root@ceph-11 ~]# ceph pg ls | grep "\[9[1-8]," 13.24 5066 0 0 0 0 41480507922 3071 3071 active+clean 2019-07-02 19:33:37.537802 73497'120563162 73511:110960694 [94,25,64] 94 [94,25,64] 94 73497'120562718 2019-07-02 19:33:37.537761 73294'120561198 2019-07-01 18:11:54.686413 13.10f 4874 0 0 0 0 39967832064 3083 3083 active+clean 2019-07-01 23:56:13.911259 73511'59603193 73511:52739094 [91,44,38] 91 [91,44,38] 91 73302'59589396 2019-07-01 23:56:13.911226 69213'59545762019-06-26 22:58:12.864475 13.17d 5001 0 0 0 0 40919228578 3088 3088 active+clean 2019-07-02 13:51:04.162137 73511'34680543 73511:26095334 [96,45,72] 96 [96,45,72] 96 73497'34678725 2019-07-02 13:51:04.162089 70393'34676042019-07-01 08:47:58.771910 13.20d 4872 0 0 0 0 40007166482 3036 3036 active+clean 2019-07-03 07:40:28.677097 73511'27811217 73511:22372286 [93,85,73] 93 [93,85,73] 93 73497'27809831 2019-07-03 07:40:28.677059 73302'27796622019-07-01 23:15:14.731237 13.214 5006 0 0 0 0 40940654592 3079 3079 active+clean 2019-07-02 21:10:51.094829 73511'34400529 73511:27161705 [94,61,53] 94 [94,61,53] 94 73497'34398612 2019-07-02 21:10:51.094784 73294'34393962019-07-01 18:54:06.249357 13.2fd 4950 0 0 0 0 40522633728 3086 3086 active+clean 2019-07-02 06:36:14.763435 73511'149011011 73511:136693896 [91,58,36] 91 [91,58,36] 91 73497'148963815 2019-07-02 06:36:14.763383 73497'148963815 2019-07-02 06:36:14.763383 13.3ae 4989 0 0 0 0 40879544320 3055 3055 active+clean 2019-07-02 00:30:44.817062 73511'67827999 73511:60578765 [91,54,25] 91 [91,54,25] 91 73302'67806651 2019-07-02 00:30:44.817017 69213'67776352
主pg不肯走啊,既然这样那就不管它了,我们前面已经设置禁止踢出节点,且我们用的是三副本,直接关闭这台机器ceph会启用副本,也不会出现数据迁移。
一个存储3份的集群, 可以容忍任意两个主机故障.,所以你需要确保已经关机的节点数量不要超出限制. 以免引发更大的故障.
4,停止服务、关闭服务器、更换ssd
新换上去的ssd使用率为0,PercentageUsed : 0
[root@ceph-11 ~]# isdct show -sensor PowerOnHours : 0x063F3 EraseFailCount : 0 EndToEndErrorDetectionCount : 0 ReliabilityDegraded : False AvailableSpare : 100 AvailableSpareBelowThreshold : False DeviceStatus : Healthy SpecifiedPCBMaxOperatingTemp : 85 SpecifiedPCBMinOperatingTemp : 0 UnsafeShutdowns : 0x00 CrcErrorCount : 0 AverageNandEraseCycles : 7 MediaErrors : 0x00 PowerCycles : 0x012 ProgramFailCount : 0 MaxNandEraseCycles : 10 HighestLifetimeTemperature : 48 PercentageUsed : 0 ThermalThrottleStatus : 0 ErrorInfoLogEntries : 0x00 MinNandEraseCycles : 6 LowestLifetimeTemperature : 16 ReadOnlyMode : False ThermalThrottleCount : 0 TemperatureThresholdExceeded : False Temperature - Celsius : 48
5,插入新的磁盘为nvme0n1
[root@ceph-11 ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 5.5T 0 disk └─sda1 8:1 0 5.5T 0 part /var/lib/ceph/osd/ceph-87 sdb 8:16 0 5.5T 0 disk └─sdb1 8:17 0 5.5T 0 part /var/lib/ceph/osd/ceph-88 sdc 8:32 0 5.5T 0 disk └─sdc1 8:33 0 5.5T 0 part /var/lib/ceph/osd/ceph-89 sdd 8:48 0 5.5T 0 disk └─sdd1 8:49 0 5.5T 0 part /var/lib/ceph/osd/ceph-90 sde 8:64 0 5.5T 0 disk └─sde1 8:65 0 5.5T 0 part /var/lib/ceph/osd/ceph-91 sdf 8:80 0 5.5T 0 disk └─sdf1 8:81 0 5.5T 0 part /var/lib/ceph/osd/ceph-92 sdg 8:96 0 5.5T 0 disk └─sdg1 8:97 0 5.5T 0 part /var/lib/ceph/osd/ceph-93 sdh 8:112 0 5.5T 0 disk └─sdh1 8:113 0 5.5T 0 part /var/lib/ceph/osd/ceph-94 sdi 8:128 0 5.5T 0 disk └─sdi1 8:129 0 5.5T 0 part /var/lib/ceph/osd/ceph-95 sdj 8:144 0 5.5T 0 disk └─sdj1 8:145 0 5.5T 0 part /var/lib/ceph/osd/ceph-96 sdk 8:160 0 5.5T 0 disk └─sdk1 8:161 0 5.5T 0 part /var/lib/ceph/osd/ceph-97 sdl 8:176 0 5.5T 0 disk └─sdl1 8:177 0 5.5T 0 part /var/lib/ceph/osd/ceph-98 sdm 8:192 0 419.2G 0 disk └─sdm1 8:193 0 419.2G 0 part / nvme0n1 259:0 0 372.6G 0 disk
6,重建journal
由于journal故障, 开机后无法正常启动osd. 需要重新创建journal,编辑脚本来生成最终执行的脚本。
#!/bin/bash desc="create ceph journal part for specified osd." type_journal_uuid=45b0969e-9b03-4f30-b4c6-b4b80ceff106 sgdisk=sgdisk journal_size=30G //分区设置大小 journal_dev=/dev/nvme0n1 //ssd磁盘名称 sleep=5 osd_uuids=$(grep "" /var/lib/ceph/osd/ceph-*/journal_uuid 2>/dev/null) die(){ echo >&2 "$@"; exit 1; } tip(){ printf >&2 "%b" "$@"; } [ "$osd_uuids" ] || die "no osd uuid found." echo "osd journal uuid:" echo "$osd_uuids" echo "now sleep $sleep" sleep $sleep journal_script="/dev/shm/ceph-journal.sh" echo "ls -l /dev/nvme0n1p*" > "$journal_script" echo "sleep 5" >> "$journal_script" # 需要预先检测分区的位置. 然后才能成功设置名称和uuid之类的数据. IFS=": " while read osd_path uuid; do let d++ [ "$osd_path" ] || continue osd_id=${osd_path#/var/lib/ceph/osd/ceph-} osd_id=${osd_id%/journal_uuid} journal_link=${osd_path%_uuid} [ ${osd_id:-1} -ge 0 ] || { echo "invalid osd id: $osd_id."; exit 11; } tip "create journal for osd $osd_id ... " $sgdisk --mbrtogpt --new=$d:0:+"$journal_size" \ --change-name=$d:'ceph journal' \ --typecode=$d:"$type_journal_uuid" \ --partition-guid=$d:"$uuid" \ "$journal_dev" || exit 1 tip "part done.\n" ln -sfT /dev/disk/by-partuuid/"$uuid" "$journal_link" || exit 3 echo "ceph-osd --mkjournal --osd-journal /dev/nvme0n1p"$d "-i "$osd_id >> "$journal_script" sleep 1 done << EOF $osd_uuids EOF
上述脚本仅用于生成最终的执行脚本. 其默认路径是
/dev/shm/ceph-journal.sh
请务必人工确认内容操作无误, 方可以root权限手动执行之
[root@ceph-11~]# bash /dev/shm/ceph-journal.sh
脚本内容:
[root@ceph-11 ~]# cat /dev/shm/ceph-journal.sh #!/bin/bash ls -l /dev/nvme0n1p* sleep 5 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p1 -i 87 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p2 -i 88 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p3 -i 89 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p4 -i 90 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p5 -i 91 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p6 -i 92 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p7 -i 93 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p8 -i 94 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p9 -i 95 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p10 -i 96 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p11 -i 97 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p12 -i 98 [root@ceph-11 ~]#
7,journal跟换完毕,检查恢复服务
osd服务已恢复
[root@ceph-11 ~]# ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -10008 0 root sas6t3 -10007 0 root sas6t2 -10006 130.94598 root sas6t1 -12 65.47299 host ceph-11 87 5.45599 osd.87 up 1.00000 0 88 5.45599 osd.88 up 0.79999 0 89 5.45599 osd.89 up 1.00000 0 90 5.45599 osd.90 up 1.00000 0 91 5.45599 osd.91 up 1.00000 0 92 5.45599 osd.92 up 1.00000 0 93 5.45599 osd.93 up 1.00000 0 94 5.45599 osd.94 up 1.00000 0 95 5.45599 osd.95 up 1.00000 0 96 5.45599 osd.96 up 1.00000 0 97 5.45599 osd.97 up 1.00000 0 98 5.45599 osd.98 up 0.89999 0
恢复osd flag,需要把干预期间的其他操作全部恢复
ceph osd unset noout
恢复osd优先级
[root@ceph-11 ~]# for osd in {87..98}; do ceph osd primary-affinity "$osd" 0.8; done set osd.87 primary-affinity to 0.8 (8524282) set osd.88 primary-affinity to 0.8 (8524282) set osd.89 primary-affinity to 0.8 (8524282) set osd.90 primary-affinity to 0.8 (8524282) set osd.91 primary-affinity to 0.8 (8524282) set osd.92 primary-affinity to 0.8 (8524282) set osd.93 primary-affinity to 0.8 (8524282) set osd.94 primary-affinity to 0.8 (8524282) set osd.95 primary-affinity to 0.8 (8524282) set osd.96 primary-affinity to 0.8 (8524282) set osd.97 primary-affinity to 0.8 (8524282) set osd.98 primary-affinity to 0.8 (8524282) [root@ceph-11 ~]#
等待集群恢复
等待集群自动recovery恢复到 HEALHTH_OK 状态.
期间如果出现 HEALTH_ERROR 状态, 可以及时跟进, 搜索Google.
[root@ceph-11 ~]# ceph -s cluster 936a5233-9441-49df-95c1-01de82a192f4 health HEALTH_WARN 12 pgs degraded 2 pgs recovering 10 pgs recovery_wait 12 pgs stuck unclean recovery 116/38259009 objects degraded (0.000%) monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0} election epoch 406, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6 fsmap e94: 1/1/1 up {0=ceph-1=up:active}, 1 up:standby osdmap e73609: 111 osds: 108 up, 108 in flags sortbitwise,require_jewel_osds pgmap v85918476: 5064 pgs, 24 pools, 89195 GB data, 12454 kobjects 261 TB used, 141 TB / 403 TB avail 116/38259009 objects degraded (0.000%) 5049 active+clean 10 active+recovery_wait+degraded 3 active+clean+scrubbing+deep 2 active+recovering+degraded recovery io 22105 kB/s, 4 objects/s client io 55017 kB/s rd, 77280 kB/s wr, 944 op/s rd, 590 op/s wr [root@ceph-11 ~]# [root@ceph-11 ~]# ceph -s cluster 936a5233-9441-49df-95c1-01de82a192f4 health HEALTH_WARN 1 pgs degraded 1 pgs recovering 1 pgs stuck unclean recovery 2/38259009 objects degraded (0.000%) monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0} election epoch 406, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6 fsmap e94: 1/1/1 up {0=ceph-1=up:active}, 1 up:standby osdmap e73609: 111 osds: 108 up, 108 in flags sortbitwise,require_jewel_osds pgmap v85918493: 5064 pgs, 24 pools, 89195 GB data, 12454 kobjects 261 TB used, 141 TB / 403 TB avail 2/38259009 objects degraded (0.000%) 5060 active+clean 3 active+clean+scrubbing+deep 1 active+recovering+degraded client io 81789 kB/s rd, 245 MB/s wr, 1441 op/s rd, 651 op/s wr [root@ceph-11 ~]# ceph -s cluster 936a5233-9441-49df-95c1-01de82a192f4 health HEALTH_OK monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0} election epoch 406, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6 fsmap e94: 1/1/1 up {0=ceph-1=up:active}, 1 up:standby osdmap e73609: 111 osds: 108 up, 108 in flags sortbitwise,require_jewel_osds pgmap v85918494: 5064 pgs, 24 pools, 89195 GB data, 12454 kobjects 261 TB used, 141 TB / 403 TB avail 5061 active+clean 3 active+clean+scrubbing+deep recovery io 7388 kB/s, 0 objects/s client io 67551 kB/s rd, 209 MB/s wr, 1153 op/s rd, 901 op/s wr [root@ceph-11 ~]#
集群状态已经恢复正常。
以上所述就是小编给大家介绍的《ceph节点journal盘更换》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:- 牛逼!Elasticsearch 集群更换节点角色有了更快的方式
- 更换blog托管
- 【iOS】动态更换图标
- AndroidStudio更换ConstraintLayout布局
- 【iOS】动态更换App图标
- CentOS 7 更换 yum 源
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Spark大数据分析技术与实战
董轶群、曹正凤、赵仁乾、王安 / 电子工业出版社 / 2017-7 / 59.00
Spark作为下一代大数据处理引擎,经过短短几年的飞跃式发展,正在以燎原之势席卷业界,现已成为大数据产业中的一股中坚力量。 《Spark大数据分析技术与实战》着重讲解了Spark内核、Spark GraphX、Spark SQL、Spark Streaming和Spark MLlib的核心概念与理论框架,并提供了相应的示例与解析。 《Spark大数据分析技术与实战》共分为8章,其中前4......一起来看看 《Spark大数据分析技术与实战》 这本书的介绍吧!