Make Linux cluster! – Work and test resources

Meanwhile I was writing applications for me, I always thinking how I could make my environment more bulletproof and stable. Fact, that I was using single systems, was always a single point of failure. Until now! At least on operating system level, I am beyond this obstacle.

This article is part of a series. Full series:
Make Linux cluster! – Beginning
Make Linux cluster! – Configure resources
Make Linux cluster! – Work and test resources
Make Linux cluster! – Pitfalls and observations

Manipulate resources

Clustre related/defined resources must be handled from crm shell. It is possible to use simple start/stop/restart commands for them.

crm(live/atihome)# resource stop bind9
crm(live/atihome)# status
Cluster Summary:
  * Stack: corosync
  * Current DC: atihome (version 2.0.5-ba59be7122) - partition with quorum
  * Last updated: Sun Dec  5 17:48:05 2021
  * Last change:  Sun Dec  5 17:48:04 2021 by root via cibadmin on atihome
  * 2 nodes configured
  * 2 resource instances configured (1 DISABLED)

Node List:
  * Online: [ atihome pihome ]

Full List of Resources:
  * DnsIP       (ocf::heartbeat:IPaddr2):        Stopped
  * bind9       (service:named):         Stopped (disabled)

It can be observed that DnsIP resource also stopped, because of colocation contrain. Constrains can be checked by constrain command:

crm(live/atihome)# resource constrain bind9
    DnsIP                                                                        (score=INFINITY, id=DnsWithIP)
    : Node pihome                                                                (score=25, id=DnsAltLocation)
    : Node atihome                                                               (score=100, id=DnsLocation)
* bind9

It can start back with resource start bind9 in crm shell.

Planned move

Sometimes, it needs to be moved manually not just during a disaster. We can use these commands:

  • resource move bind9 pihome: Move bind9 and DnsIP to pihome node
  • resource clear bind9: Clear every constraint about the resource, it will move the highest score location (atihome)
crm(live/atihome)# resource move bind9 pihome
INFO: Move constraint created for bind9 to pihome
crm(live/atihome)# status
Cluster Summary:
  * Stack: corosync
  * Current DC: atihome (version 2.0.5-ba59be7122) - partition with quorum
  * Last updated: Sun Dec  5 18:43:11 2021
  * Last change:  Sun Dec  5 18:43:08 2021 by root via crm_resource on atihome
  * 2 nodes configured
  * 2 resource instances configured

Node List:
  * Online: [ atihome pihome ]

Full List of Resources:
  * DnsIP       (ocf::heartbeat:IPaddr2):        Started pihome
  * bind9       (service:named):         Started pihome
crm(live/atihome)# resource clear bind9
INFO: Removed migration constraints for bind9
crm(live/atihome)# status
Cluster Summary:
  * Stack: corosync
  * Current DC: atihome (version 2.0.5-ba59be7122) - partition with quorum
  * Last updated: Sun Dec  5 18:43:21 2021
  * Last change:  Sun Dec  5 18:43:20 2021 by root via crm_resource on atihome
  * 2 nodes configured
  * 2 resource instances configured

Node List:
  * Online: [ atihome pihome ]

Full List of Resources:
  * DnsIP       (ocf::heartbeat:IPaddr2):        Started atihome
  * bind9       (service:named):         Started atihome

Service fail

It can happen that service fails on a node. Depending from migration-threshold and restart option more things can happen:

  • If restart option is never, it do nothing but move
  • If restart option is, at least, on-failure and threhsold is not reached, it will be restarted locally
  • If threhsold reached, then it is moved to another node

I made a failure by stopping bind9 outside of crm. It detected the error at the next monitor interval and restarted in place. It also put an info message about it in status output:

crm(live/atihome)# status
Cluster Summary:
  * Stack: corosync
  * Current DC: atihome (version 2.0.5-ba59be7122) - partition with quorum
  * Last updated: Sun Dec  5 18:43:21 2021
  * Last change:  Sun Dec  5 18:43:20 2021 by root via crm_resource on atihome
  * 2 nodes configured
  * 2 resource instances configured

Node List:
  * Online: [ atihome pihome ]

Full List of Resources:
  * DnsIP       (ocf::heartbeat:IPaddr2):        Started atihome
  * bind9       (service:named):         Started atihome

Failed Resource Actions:
  * bind9_monitor_30000 on atihome 'not running' (7): call=45, status='complete', exitreason='', last-rc-change='2021-12-05 17:05:46 +01:00', queued=0ms, exec=0ms

System fail

I stopped corosync and pacemaker manually on main server, thus making a disaster test. Resources ahs been moved to another node:

crm(live/pihome)# status
Cluster Summary:
  * Stack: corosync
  * Current DC: pihome (version 2.0.5-ba59be7122) - partition with quorum
  * Last updated: Sun Dec  5 18:48:28 2021
  * Last change:  Sun Dec  5 18:43:20 2021 by root via crm_resource on atihome
  * 2 nodes configured
  * 2 resource instances configured

Node List:
  * Online: [ pihome ]
  * OFFLINE: [ atihome ]

Full List of Resources:
  * DnsIP       (ocf::heartbeat:IPaddr2):        Started pihome
  * bind9       (service:named):         Started pihome

After starting corosync and pacemaker back, resources moved back to atihome:

crm(live/pihome)# status
Cluster Summary:
  * Stack: corosync
  * Current DC: pihome (version 2.0.5-ba59be7122) - partition with quorum
  * Last updated: Sun Dec  5 18:50:09 2021
  * Last change:  Sun Dec  5 18:43:20 2021 by root via crm_resource on atihome
  * 2 nodes configured
  * 2 resource instances configured

Node List:
  * Online: [ atihome pihome ]

Full List of Resources:
  * DnsIP       (ocf::heartbeat:IPaddr2):        Started atihome
  * bind9       (service:named):         Started atihome

There is anotehr function called fence, which can do thing when a node is out of the cluster. For example, in this case, system was still alive while not in cluster, it can be stopped or reboot by another node to make sure “dead remains dead” and not causing issue.

But fence is not configured yet. So cluster engine believes that those system, what cannot see, are stopped totally. It can be an issue in production systems and could cause consystency issues, but I have not configured fence mechanism in my home lab.

Put node to maintenance

It can happen, that I am making maintenance on a node and it is not desired that crm would do anything there. For this, we can put systems or a resource into maintenance mode. In maintenance mode, crm is not managaing the affected resources.

Resources are handled via resoruce sub-command. It can mainteance parameter: resource maintenance <resource> on/off.

crm(live/atihome)# resource maintenance bind9 on
crm(live/atihome)# status
Cluster Summary:
  * Stack: corosync
  * Current DC: pihome (version 2.0.5-ba59be7122) - partition with quorum
  * Last updated: Sun Dec  5 19:05:22 2021
  * Last change:  Sun Dec  5 19:05:15 2021 by root via cibadmin on atihome
  * 2 nodes configured
  * 2 resource instances configured

Node List:
  * Online: [ atihome pihome ]

Full List of Resources:
  * DnsIP       (ocf::heartbeat:IPaddr2):        Started atihome
  * bind9       (service:named):         Started atihome (unmanaged)
crm(live/atihome)# resource maintenance bind9 off
crm(live/atihome)# status
Cluster Summary:
  * Stack: corosync
  * Current DC: pihome (version 2.0.5-ba59be7122) - partition with quorum
  * Last updated: Sun Dec  5 19:05:40 2021
  * Last change:  Sun Dec  5 19:05:39 2021 by root via cibadmin on atihome
  * 2 nodes configured
  * 2 resource instances configured

Node List:
  * Online: [ atihome pihome ]

Full List of Resources:
  * DnsIP       (ocf::heartbeat:IPaddr2):        Started atihome
  * bind9       (service:named):         Started atihome

Working with nodes has a bit different syntax. node maintenance <name> out it into maintenance mode and node ready <name> will undo the maintancen mode:

crm(live/atihome)# node maintenance atihome
'maintenance' attribute already exists in bind9. Remove it (y/n)? y
crm(live/atihome)# status
Cluster Summary:
  * Stack: corosync
  * Current DC: pihome (version 2.0.5-ba59be7122) - partition with quorum
  * Last updated: Sun Dec  5 19:03:03 2021
  * Last change:  Sun Dec  5 19:03:00 2021 by root via cibadmin on atihome
  * 2 nodes configured
  * 2 resource instances configured

Node List:
  * Node atihome: maintenance
  * Online: [ pihome ]

Full List of Resources:
  * DnsIP       (ocf::heartbeat:IPaddr2):        Started atihome (unmanaged)
  * bind9       (service:named):         Started atihome (unmanaged)
crm(live/atihome)# node ready atihome
crm(live/atihome)# status
Cluster Summary:
  * Stack: corosync
  * Current DC: pihome (version 2.0.5-ba59be7122) - partition with quorum
  * Last updated: Sun Dec  5 19:03:31 2021
  * Last change:  Sun Dec  5 19:03:29 2021 by root via crm_attribute on atihome
  * 2 nodes configured
  * 2 resource instances configured

Node List:
  * Online: [ atihome pihome ]

Full List of Resources:
  * DnsIP       (ocf::heartbeat:IPaddr2):        Started atihome
  * bind9       (service:named):         Started atihome

As it can be seen, each resource which, was running on atihome, inherited maintenance mode.

Final words

Handling resources is simply, at least for now. Command line interface also seems handy after a few hour testing and practice.

Ati

Enthusiast for almost everything which is IT and technology. Like working and playing with different platforms, from the smallest embedded systems, through mid-servers (mostly Linux) and desktop PC (Windows), till the mainframes. Interested in both hardware and software.

You may also like...