I run a PB scale ceph cluster providing RBD and RGW access, We are used to ceph’s point release updates and the advice from Sage to update existing clusters to the new version as soon as possible.
I had lot of experience upgrading ceph clusters all the way back from Hammer to current Firefly, The process has always been to update the mon’s followed by the OSD’s and finally the clients in my case RGW’s and Openstack Clients. But the process of doing updates manually was time consuming and very dull - To the rescue comes ansible, this the below playbook we are able to upgrade a PB scale cluster from 0.94.5 to 0.94.6 in just over 1hr fully unattended.
GitHub Raw Link
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
- hosts: all
tasks:
- name: Update packages
apt: upgrade=dist update_cache=yes
- hosts: mons
serial: 1
tasks:
- name: Restart ceph-mon service
service: >
name=ceph-mon-all
state=restarted
- name: Waiting for the monitor to join the quorum
shell: >
ceph -s | grep quorum | head -n1 | egrep -sq
register: result
until: result.rc == 0
retries: 5
delay: 10
- hosts: osd
serial: 1
tasks:
- name: Set OSD flag for pcie
command: ceph osd set
with_items:
- noout
- noscrub
- nodeep-scrub
delegate_to: ceph-mon-1
- name: Waiting for clean PGs pre restart
shell: >
test "$(ceph pg stat | sed 's/^.*pgs://;s/active+clean.*//;s/ //')" -eq "$(ceph pg stat | sed 's/pgs.*//;s/^.*://;s/ //')" && ceph health | egrep -sq "HEALTH_OK|HEALTH_WARN"
register: result
until: result.rc == 0
retries: 300
delay: 10
delegate_to: ceph-mon-1
- name: Restart OSD processes
service: >
name=ceph-osd-all
state=restarted
- name: Waiting for clean PGs post restart
shell: >
test "$(ceph pg stat | sed 's/^.*pgs://;s/active+clean.*//;s/ //')" -eq "$(ceph pg stat | sed 's/pgs.*//;s/^.*://;s/ //')" && ceph health | egrep -sq "HEALTH_OK|HEALTH_WARN"
register: result
until: result.rc == 0
retries: 300
delay: 10
delegate_to: ceph-mon-1
- name: UnSet OSD maintenance flags
command: ceph osd unset
with_items:
- noout
- noscrub
- nodeep-scrub
delegate_to: ceph-mon-1
- hosts: radosgw
serial: 1
tasks:
- name: Restart rgw service after upgrade
service: >
name=radosgw-all
state=restarted
- name : wait for the rgw service to be running
# TODO - replace this with a until: service status=running?
shell: >
pgrep radosgw
register: result
until: result.rc == 0
retries: 100
delay: 10
- hosts: loadbalancers
serial: 1
tasks:
- name: Restart Load Balancers after the upgrade
service: >
name=haproxy
state=restarted
- name : wait for the haproxy service to be running
shell: >
pgrep haproxy
register: result
until: result.rc == 0
retries: 100
delay: 10