Meanwhile I was writing applications for me, I always thinking how I could make my environment more bulletproof and stable. Fact, that I was using single systems, was always a single point of failure. Until now! At least on operating system level, I am beyond this obstacle.
This article is part of a series. Full series:
Make Linux cluster! – Beginning
Make Linux cluster! – Configure resources
Make Linux cluster! – Work and test resources
Make Linux cluster! – Pitfalls and observations
Manipulate resources
Clustre related/defined resources must be handled from crm shell. It is possible to use simple start/stop/restart commands for them.
crm(live/atihome)# resource stop bind9 crm(live/atihome)# status Cluster Summary: * Stack: corosync * Current DC: atihome (version 2.0.5-ba59be7122) - partition with quorum * Last updated: Sun Dec 5 17:48:05 2021 * Last change: Sun Dec 5 17:48:04 2021 by root via cibadmin on atihome * 2 nodes configured * 2 resource instances configured (1 DISABLED) Node List: * Online: [ atihome pihome ] Full List of Resources: * DnsIP (ocf::heartbeat:IPaddr2): Stopped * bind9 (service:named): Stopped (disabled)
It can be observed that DnsIP resource also stopped, because of colocation contrain. Constrains can be checked by constrain command:
crm(live/atihome)# resource constrain bind9 DnsIP (score=INFINITY, id=DnsWithIP) : Node pihome (score=25, id=DnsAltLocation) : Node atihome (score=100, id=DnsLocation) * bind9
It can start back with resource start bind9
in crm shell.
Planned move
Sometimes, it needs to be moved manually not just during a disaster. We can use these commands:
resource move bind9 pihome
: Move bind9 and DnsIP to pihome noderesource clear bind9
: Clear every constraint about the resource, it will move the highest score location (atihome)
crm(live/atihome)# resource move bind9 pihome INFO: Move constraint created for bind9 to pihome crm(live/atihome)# status Cluster Summary: * Stack: corosync * Current DC: atihome (version 2.0.5-ba59be7122) - partition with quorum * Last updated: Sun Dec 5 18:43:11 2021 * Last change: Sun Dec 5 18:43:08 2021 by root via crm_resource on atihome * 2 nodes configured * 2 resource instances configured Node List: * Online: [ atihome pihome ] Full List of Resources: * DnsIP (ocf::heartbeat:IPaddr2): Started pihome * bind9 (service:named): Started pihome
crm(live/atihome)# resource clear bind9 INFO: Removed migration constraints for bind9 crm(live/atihome)# status Cluster Summary: * Stack: corosync * Current DC: atihome (version 2.0.5-ba59be7122) - partition with quorum * Last updated: Sun Dec 5 18:43:21 2021 * Last change: Sun Dec 5 18:43:20 2021 by root via crm_resource on atihome * 2 nodes configured * 2 resource instances configured Node List: * Online: [ atihome pihome ] Full List of Resources: * DnsIP (ocf::heartbeat:IPaddr2): Started atihome * bind9 (service:named): Started atihome
Service fail
It can happen that service fails on a node. Depending from migration-threshold
and restart
option more things can happen:
- If restart option is never, it do nothing but move
- If restart option is, at least, on-failure and threhsold is not reached, it will be restarted locally
- If threhsold reached, then it is moved to another node
I made a failure by stopping bind9 outside of crm. It detected the error at the next monitor interval and restarted in place. It also put an info message about it in status output:
crm(live/atihome)# status Cluster Summary: * Stack: corosync * Current DC: atihome (version 2.0.5-ba59be7122) - partition with quorum * Last updated: Sun Dec 5 18:43:21 2021 * Last change: Sun Dec 5 18:43:20 2021 by root via crm_resource on atihome * 2 nodes configured * 2 resource instances configured Node List: * Online: [ atihome pihome ] Full List of Resources: * DnsIP (ocf::heartbeat:IPaddr2): Started atihome * bind9 (service:named): Started atihome Failed Resource Actions: * bind9_monitor_30000 on atihome 'not running' (7): call=45, status='complete', exitreason='', last-rc-change='2021-12-05 17:05:46 +01:00', queued=0ms, exec=0ms
System fail
I stopped corosync and pacemaker manually on main server, thus making a disaster test. Resources ahs been moved to another node:
crm(live/pihome)# status Cluster Summary: * Stack: corosync * Current DC: pihome (version 2.0.5-ba59be7122) - partition with quorum * Last updated: Sun Dec 5 18:48:28 2021 * Last change: Sun Dec 5 18:43:20 2021 by root via crm_resource on atihome * 2 nodes configured * 2 resource instances configured Node List: * Online: [ pihome ] * OFFLINE: [ atihome ] Full List of Resources: * DnsIP (ocf::heartbeat:IPaddr2): Started pihome * bind9 (service:named): Started pihome
After starting corosync and pacemaker back, resources moved back to atihome:
crm(live/pihome)# status Cluster Summary: * Stack: corosync * Current DC: pihome (version 2.0.5-ba59be7122) - partition with quorum * Last updated: Sun Dec 5 18:50:09 2021 * Last change: Sun Dec 5 18:43:20 2021 by root via crm_resource on atihome * 2 nodes configured * 2 resource instances configured Node List: * Online: [ atihome pihome ] Full List of Resources: * DnsIP (ocf::heartbeat:IPaddr2): Started atihome * bind9 (service:named): Started atihome
There is anotehr function called fence, which can do thing when a node is out of the cluster. For example, in this case, system was still alive while not in cluster, it can be stopped or reboot by another node to make sure “dead remains dead” and not causing issue.
But fence is not configured yet. So cluster engine believes that those system, what cannot see, are stopped totally. It can be an issue in production systems and could cause consystency issues, but I have not configured fence mechanism in my home lab.
Put node to maintenance
It can happen, that I am making maintenance on a node and it is not desired that crm would do anything there. For this, we can put systems or a resource into maintenance mode. In maintenance mode, crm is not managaing the affected resources.
Resources are handled via resoruce sub-command. It can mainteance parameter: resource maintenance <resource> on/off
.
crm(live/atihome)# resource maintenance bind9 on crm(live/atihome)# status Cluster Summary: * Stack: corosync * Current DC: pihome (version 2.0.5-ba59be7122) - partition with quorum * Last updated: Sun Dec 5 19:05:22 2021 * Last change: Sun Dec 5 19:05:15 2021 by root via cibadmin on atihome * 2 nodes configured * 2 resource instances configured Node List: * Online: [ atihome pihome ] Full List of Resources: * DnsIP (ocf::heartbeat:IPaddr2): Started atihome * bind9 (service:named): Started atihome (unmanaged)
crm(live/atihome)# resource maintenance bind9 off crm(live/atihome)# status Cluster Summary: * Stack: corosync * Current DC: pihome (version 2.0.5-ba59be7122) - partition with quorum * Last updated: Sun Dec 5 19:05:40 2021 * Last change: Sun Dec 5 19:05:39 2021 by root via cibadmin on atihome * 2 nodes configured * 2 resource instances configured Node List: * Online: [ atihome pihome ] Full List of Resources: * DnsIP (ocf::heartbeat:IPaddr2): Started atihome * bind9 (service:named): Started atihome
Working with nodes has a bit different syntax. node maintenance <name>
out it into maintenance mode and node ready <name>
will undo the maintancen mode:
crm(live/atihome)# node maintenance atihome 'maintenance' attribute already exists in bind9. Remove it (y/n)? y crm(live/atihome)# status Cluster Summary: * Stack: corosync * Current DC: pihome (version 2.0.5-ba59be7122) - partition with quorum * Last updated: Sun Dec 5 19:03:03 2021 * Last change: Sun Dec 5 19:03:00 2021 by root via cibadmin on atihome * 2 nodes configured * 2 resource instances configured Node List: * Node atihome: maintenance * Online: [ pihome ] Full List of Resources: * DnsIP (ocf::heartbeat:IPaddr2): Started atihome (unmanaged) * bind9 (service:named): Started atihome (unmanaged)
crm(live/atihome)# node ready atihome crm(live/atihome)# status Cluster Summary: * Stack: corosync * Current DC: pihome (version 2.0.5-ba59be7122) - partition with quorum * Last updated: Sun Dec 5 19:03:31 2021 * Last change: Sun Dec 5 19:03:29 2021 by root via crm_attribute on atihome * 2 nodes configured * 2 resource instances configured Node List: * Online: [ atihome pihome ] Full List of Resources: * DnsIP (ocf::heartbeat:IPaddr2): Started atihome * bind9 (service:named): Started atihome
As it can be seen, each resource which, was running on atihome, inherited maintenance mode.
Final words
Handling resources is simply, at least for now. Command line interface also seems handy after a few hour testing and practice.