Extending Minimum Time Between IPMI Operations in OpenStack Ironic

Sep 29, 2018 Cloud Computing, Linux

Extending Minimum Time Between IPMI Operations in OpenStack Ironic
In TripleO based installations of OpenStack the deployment of Overcloud nodes is coordinated by so called Ironic module, which was developed in general to deploy OS on bare metal servers. There is a number of IPMI (Intelligent Platform Management Interface) drivers supported by Ironic, like pxe_ilo for HP Proliant Gen8+, pxe_drac for DELL 12G+, ipmi for generic servers, etc… In case of the generic ipmi driver, Ironic script, which is installed on undercloud (or OSP director), uses a Linux command line ipmitool to send IPMI messages to the Overcloud nodes to control them during TripleO deployment.

During the TripleO deployment, the Ironic module powers on each node, uploads kernel image and installs the OS, then reboots each node and when the node reports it’s up and running after reboot, the deployment is considered as accomplished. Each IPMI reboot operation is in fact a combination of shutdown and power on messages. Ironic contains some default timing configuration, like number of retries of IPMI operations, maximum time to retry IPMI operations, and so on. Below I present the default ipmi section from /etc/ironic/ironic.conf file for Pike release:

[ipmi]
#command_retry_timeout = 60
retry_timeout = 15
#min_command_interval = 5

When I deploy RDO TripleO Pike release on my environment, consisting of the old Oracle Sun Fire X4270 servers, HP Proliant Gen5 and Gen6 server, I notice, that on the default Ironic settings HP Proliant Gen6 and Gen5 servers are responding to IPMI messages being sent from undercloud and power off, then start without any problems. However, Oracle Sun Fire X4270 servers, after shutting down, never get up. In the Ironic log file /var/log/ironic/ironic-conductor.log I can see that the servers get the valid “power on” IPMI message:

...
2018-07-11 23:55:15.259 1433 DEBUG oslo_concurrency.processutils [req-903e4775-419f-48ef-b42a-504dee9c2eff 3f7f46753ff54c509c34070fc992355c 325bf08fee894f6a83ad5fd288b1226d - default default] CMD "ipmitool -I lanplus -H 192.168.2.13 -L ADMINISTRATOR -U admin -R 12 -N 5 -f /tmp/tmpfw1WUj power on" returned: 0 in 0.264s execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:404
2018-07-11 23:55:15.262 1433 DEBUG ironic.common.utils [req-903e4775-419f-48ef-b42a-504dee9c2eff 3f7f46753ff54c509c34070fc992355c 325bf08fee894f6a83ad5fd288b1226d - default default] Execution completed, command line is "ipmitool -I lanplus -H 192.168.2.13 -L ADMINISTRATOR -U admin -R 12 -N 5 -f /tmp/tmpfw1WUj power on" execute /usr/lib/python2.7/site-packages/ironic/common/utils.py:75
2018-07-11 23:55:15.263 1433 DEBUG ironic.common.utils [req-903e4775-419f-48ef-b42a-504dee9c2eff 3f7f46753ff54c509c34070fc992355c 325bf08fee894f6a83ad5fd288b1226d - default default] Command stdout is: "Chassis Power Control: Up/On
" execute /usr/lib/python2.7/site-packages/ironic/common/utils.py:76
2018-07-11 23:55:15.265 1433 DEBUG ironic.common.utils [req-903e4775-419f-48ef-b42a-504dee9c2eff 3f7f46753ff54c509c34070fc992355c 325bf08fee894f6a83ad5fd288b1226d - default default] Command stderr is: "" execute /usr/lib/python2.7/site-packages/ironic/common/utils.py:77
...

…but soon after I see the time out errors, that the particular node failed to start after 60 seconds:

[root@undercloud ~]# tail -f /var/log/ironic/ironic-conductor.log
2018-07-11 23:56:59.059 1433 ERROR oslo.service.loopingcall Traceback (most recent call last):
2018-07-11 23:56:59.059 1433 ERROR oslo.service.loopingcall   File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 141, in _run_loop
2018-07-11 23:56:59.059 1433 ERROR oslo.service.loopingcall     idle = idle_for_func(result, watch.elapsed())
2018-07-11 23:56:59.059 1433 ERROR oslo.service.loopingcall   File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 338, in _idle_for
2018-07-11 23:56:59.059 1433 ERROR oslo.service.loopingcall     % self._error_time)
2018-07-11 23:56:59.059 1433 ERROR oslo.service.loopingcall LoopingCallTimeOut: Looping call timed out after 47.48 seconds
2018-07-11 23:56:59.059 1433 ERROR oslo.service.loopingcall
2018-07-11 23:56:59.061 1433 ERROR ironic.conductor.utils [req-577784b3-ef60-4068-9693-795a0f43f37a - - - - -] Timed out after 60 secs waiting for power power on on node 63fde05d-d96c-4ad0-8f5a-7c4a6221bd5b.: LoopingCallTimeOut: Looping call timed out after 47.48 seconds
2018-07-11 23:56:59.153 1433 ERROR ironic.drivers.modules.agent_base_vendor [req-577784b3-ef60-4068-9693-795a0f43f37a - - - - -] Error rebooting node 63fde05d-d96c-4ad0-8f5a-7c4a6221bd5b after deploy. Error: Failed to set node power state to power on.: PowerStateFailure: Failed to set node power state to power on.
2018-07-11 23:56:59.156 1433 DEBUG ironic.drivers.modules.agent_client [req-577784b3-ef60-4068-9693-795a0f43f37a - - - - -] Executing agent command log.collect_system_logs for node 63fde05d-d96c-4ad0-8f5a-7c4a6221bd5b _command /usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_client.py:62

Looks like for the Sun Fire X4270 servers the intervals between “power off” and “power on” IPMI messages are too short due to their longer shutdown time. When the “power on” message is being sent to the server, the server is still shutting down and ignores the message.

Changing the min_command_interval parameter’s value to 15 seconds fixed the issue on Sun Fire X4270. The ipmi section of /etc/ironic/ironic.conf after modifications:

[ipmi]
#command_retry_timeout = 60
retry_timeout = 15
min_command_interval = 15

Now the servers have enough time to shut down successfully and wait for “power on” IPMI message from Undercloud.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.