Callhome SRC E3325009 reported for Version 9

Created by Karyn Corneli on Mon, 09/09/2019 - 13:53

Published URL:

https://www.ibm.com/support/pages/node/1073300

1073300

Troubleshooting

Problem

Call home SRC E3325009 occurs when the primary HMC fails to remaster within the time window for arbitration. This event is reported to notify the customer that the managed system might be running without a primary HMC to perform problem analysis if a service event were to occur against the system. The primary HMC function is critical to ensure hardware events are reported and called home to IBM.

Symptom

Call home SRC E3325009 is listed on the HMC, under Manage Serviceable Events or in lssvcevents -t hardware, reported against the HMC MTMS that is the primary HMC. The SRC E3325009 is reported at least once per day due to the scheduled operation that runs for rearbitration. It can also be reported due to an HMC reboot or if the managed system is removed and readded to the HMC.

Cause

The HMC has 15 minutes to remaster and establish a primary analyzing HMC when arbitration runs. If it doesn't, the HMC will report the SRC and call home to IBM. This is to report that the customer has a managed system without a primary HMC to perform problem analysis and call home system level errors, SRC events. This SRC reported at Version 9 is due to:

1. The rearbitration schedule in V9, which causes the SRC to be reported once a day.

2. Communication attempts between the primary HMC and the secondary HMC across all interfaces, which leads to exceeding the timeout value.

3. Connectivity issues between the HMC and one or more managed systems where the lssysconn -r all command shows the state as "No Connection", "Pending Authentication", "Failed Authentication"

Environment

Version 9.1 M930 + any additional PTFs

Any HMC Hardware Appliance platform or Virtual Appliance platform

Diagnosing The Problem

Call home SRC E3325009 is generated every day around the same time

Resolving The Problem

If the issue has to do with Connectivity to one or more managed systems, resolve the connectivity state of "No Connection", "Pending Authentication", or "Failed Authentication" as seen in the lssysconn -r all output. After resolving the connectivity issues monitor for no more reoccurrences of the SRC reporting.

If the issue is not related to connectivity, the full fix which includes the M940 fixes for E3325009 plus an ipv6 disablement update is now available in the V9.1 M941 Service Pack Release

PTF MH01859 HMC V9 R1.941.0 Service Pack - for 7042 Machine Types or vHMC for x86 hypervisors (5765-HMW)

PTF MH01860 HMC V9 R1.941.0 Service Pack - for 7063 Machine Types or vHMC for PowerVM (5765-HMB)

Another fix is available specific to any configuration that includes Virtual HMC arbitration in HMC V9 R1.941.2:

PTF MH01864 HMC V9 R1.941.2 - for 7042 Machine Types or vHMC for x86 hypervisors (5765-HMW)

PTF MH01865 HMC V9 R1.941.2 - for 7063 Machine Types or vHMC for PowerVM (5765-HMB)

Fixed vHMC MTMS parsing logic for primary HMC arbitration to handle virtual HMCs uniquely to prevent the chprimhmc command error "Failed to get service operation information from peer" and subsequent E3325009 reporting.

Fix Central Download

Simple steps for mitigating E3325009:

Step 1. In *ALL* environments where ipv6 is not used, ensure ipv6 is disabled: chhmc -s disable --ipv6 -c network

NOTE: Step 2 is a very important sequence in resolving the reporting of E3325009. Step 2 clears the cached network and interface entries between the HMCs and allows the HMCs to perform fresh connection discovery of each other. This resolves why the HMCs were unable to connect and prevents E3325009 from reporting.

Step 2. *DO NOT SKIP THIS STEP* Re-save the settings for the network on each HMC. Navigate to Console Settings -> Change Network Settings -> click LAN Adapters and verify:

Each ethX is set with the expected IP address correctly
References to localhost or localdomain are not listed under Name Services tab or under the Identification tab.

Then click OK to re-save the settings, even if no changes were made. Reboot each HMC one at a time to refresh the information for primary arbitration, starting with the HMC not reporting the E3325009 event, followed by the HMC reporting the E3325009 event.

If E3325009 continues to report the most likely cause is the HMC has a reference to an HMC name or IP address that is no longer valid or reachable.

Some examples include:

The order the HMCs were rebooted caused one HMC to not get refreshed with the new network information. Reboot the HMC reporting the E3325009 one more time.
An older HMC model was replaced with a new model and given the same hostname and IP address
An HMC name or IP address was updated in DNS, but the changes have not been made on the HMC itself
Port 9920 is blocked in the HMC firewall or in the customer's firewall between the HMCs

FURTHER TROUBLESHOOTING STEPS IF Step 1 and 2 do not mitigate the reporting of E3325009:

1. Verify that all the ipaddresses defined on the HMC are correct and in use:

hmc01>lshmc -n -Fhostname:ipaddr
hmc01:10.1.1.10,11.9.9.9,10.0.0.32

2. If an ipaddress was decommissioned or is not supposed to be in use by the HMC, remove it from the configuration in the "Change Network Settings" menu under HMC Management. Continue with the additional steps for further network validation before rebooting both HMCs.

Example: hostname is hmc01 and ipaddress is 10.1.1.10
hostname is hmc02 and ipaddress is 10.1.1.20

2a. If DNS is not in use, ensure a unique hostname is set, not localhost, and skip to step 3.

2b. If DNS is in use, verify the dns setup is identical on both HMCs.

From hmc01:

hmc01>lshmc -n -Fhostname:domain:nameserver:dns:domainsuffix
hmc01:boulder.ibm.com:10.0.130.50,10.0.128.50:enabled:boulder.ibm.com

From hmc02:

hmc02>lshmc -n -Fhostname:domain:nameserver:dns:domainsuffix
hmc02:boulder.ibm.com:10.0.130.50,10.0.128.50:enabled:boulder.ibm.com

2c. Check forward and reverse lookup matches on both HMCs for each other.

From hmc01:

hmc01>host hmc02
hmc02.boulder.ibm.com had address 10.1.1.20

hmc01>host 10.1.1.20
10.1.1.20 in-addr.arpa domain name pointer hmc02.boulder.ibm.com

From hmc02:

hmc02>host hmc01
hmc01.boulder.ibm.com had address 10.1.1.10

hmc02>host 10.1.1.10
10.1.1.10 in-addr.arpa domain name pointer hmca01.boulder.ibm.com

The reverse lookup shows hmca01 not hmc01, which is not a match. The DNS records need to be fixed to ensure forward and reverse lookup match.

3. Verify that port 9920 is open in the firewall for all interfaces on both HMCs with no restriction on the allowed host:

hmc01>lshmc --firewall | grep 9920
interface=eth1,application=FCS,"ports=9920:tcp,udp:9900",allowedhost=0.0.0.0/0.0.0.0
interface=eth2,application=FCS,"ports=9920:tcp,udp:9900",allowedhost=0.0.0.0/0.0.0.0
interface=eth0,application=FCS,"ports=9920:tcp,udp:9900",allowedhost=0.0.0.0/0.0.0.0
interface=eth1,application=FCS,"ports=9920:tcp6,udp6:9900",allowedhost=::/::
interface=eth2,application=FCS,"ports=9920:tcp6,udp6:9900",allowedhost=::/::
interface=eth0,application=FCS,"ports=9920:tcp6,udp6:9900",allowedhost=::/::

Repeat the lshmc --firewall command from hmc02 and verify the same results.

4. Verify 9920 is connected between the HMCs and shows as "ESTABLISHED":

hmc01>cat $(ls -tr /dump/HMCMonitor/netstat* | tail -1) | grep 9920 | grep 10.1.1.20

tcp6 0 0 10.1.1.10:9920 10.1.1.20:37976 ESTABLISHED 7340/java

hmc02>cat $(ls -tr /dump/HMCMonitor/netstat* | tail -1) | grep 9920 | grep 10.1.1.10

tcp6 0 0 10.1.1.20:39998 10.1.1.10:9920 ESTABLISHED 7952/java

5. Verify with the network administrator that no firewall is blocking port 9920 between the HMCs for communication if the 9920 port shows any other status than ESTABLISHED or is not present.

6. Verify each HMC knows who is primary for each managed system after a reboot:

From hmc01:

hmc01>lsprimhmc -m 9009-42A-78XXXXX
is_primary=1,primary_hmc_mtms=VM-Vdxx-xxx/xxxxxxx,primary_hmc_ipaddr=10.1.1.10,primary_hmc_hostname=hmc01

From hmc02:

hmc02>lsprimhmc -m 9009-42A-78XXXXX
is_primary=0,primary_hmc_mtms=VM-Vdxx-xxx/xxxxxxx,primary_hmc_ipaddr=10.1.1.10,primary_hmc_hostname=hmc01

If the output shows good values and good communication, E3325009 may report one time around the HMC rebooted for one or more managed systems due to timing, but will stop reporting on the daily schedule basis. If the output shows blank values from either HMC, then the HMCs are still not communicating with each other and E3325009 will continue reporting every day. A pedbg -c -q 4 collection from both HMCs is necessary for further debugging of the problem. Contact IBM hardware support for additional assistance.

Blank value example:

hmc02>lsprimhmc -m 9009-42A-78XXXXX
is_primary=0,primary_hmc_mtms=,primary_hmc_ipaddr=,primary_hmc_hostname=

Internal Use Only

How to verify the customer did the *MOST IMPORTANT* Step 2. Open up the taskRecorder0.trm from the call home data. Verify the customer clicked on "Change Network Settings" and completed the task by clicking the "OK"button. Verify the customer rebooted after making the change. The time of the change should be later than the last E3325009 reported on a scheduled basis. E3325009 can report during the reboot if arbitration takes too long, but the managed system should shortly afterward show an established primary HMC MTMS in a PA_DOM_PRM entry and not null for the primary.

Additional information behind once the fix is installed, E3325009 can continue to report. In one of the iqyylogs for one of the HMCs, the details of why the two HMCs are not communicating will be in the following entries:

Label: PA SRC: E3321037

Scroll to the bottom of the E332 entry and look for the fullyResolveMachineId section; which will include a parsing of what FCS has for the HMC ethX entries. When E3321037 is logged, there will be a mis-match in the machineId entry versus the fullyResolved entry. This is why Step 2 performed on both HMCs in the action plan is the MOST IMPORTANT step, especially once all name resolution is resolved OR when no network issue is identified, yet E3325009 continues to report. This mismatch is cached network settings, which get refreshed on the "click OK" during the save network settings and subsequent reboot.

Step 1) Locate the beginning of the stanza of entries by searching on "fullyResolveMachineId"

2020-06-04@10:21:04.542771 00004493:000083E8 XFCSSVRT: -> fullyResolveMachineId(/10.177.181.91)

Step 2) Locate the "machineId" entry, and look for each "NIC Address[x]" entry taking note of the number [1-5] and the value of each:

2020-06-04@1"0:21:04.542809 00004493:000083E8 XFCSCONF: machineId is [MachineId: Name: hmc21857, Group: PHMC, MTMS: 7063_CR1-130XXXX, NIC Address[0]: 10.177.181.91 (ac1f6b0fa944) (false), NIC Address[1]: fe80:0:0:0:ae1f:6bff:fe0f:a944 (ac1f6b0fa944) (false), NIC Address[2]: 10.0.255.1 (ac1f6b0fa945) (true), NIC Address[3]: fe80:0:0:0:ae1f:6bff:fe0f:a945 (ac1f6b0fa945) (true), NIC Address[4]: 0.0.0.0 (ac1f6b0fa946) (true), NIC Address[5]: 0.0.0.0 (ac1f6b0fa947) (true)]

Step 3) Locate the "fullyResolved" entry, and look for each "NIC Address[x]" entry taking note of the number [1-3] and the value of each:

2020-06-04@10:21:04.543110 00004493:000083E8 XFCSCONF: fullyResolved is [MachineId: Name: hmc21857, Group: PHMC, MTMS: 7063_CR1-130XXXX, NIC Address[0]: 10.177.181.91 (ac1f6b0fa944) (false), NIC Address[1]: 10.0.255.1 (ac1f6b0fa945) (true), NIC Address[2]: 0.0.0.0 (ac1f6b0fa946) (true), NIC Address[3]: 0.0.0.0 (ac1f6b0fa947) (true)]

Step 4) Compare the number of entries and type of entries. All of the entries must match ip(macaddress) with a corresponding ip(macaddress). If there is a difference in the number of entries or the entry ip address/value then this is the problem causing E3325009.

Bad example:

Color coded key:

MATCH <-- NIC Address[0]: 10.177.181.91 (ac1f6b0fa944) = NIC Address[0]: 10.177.181.91 (ac1f6b0fa944)

NIC Address[2]: 10.0.255.1 (ac1f6b0fa945) = NIC Address[1]: 10.0.255.1 (ac1f6b0fa945)

NIC Address[4]: 0.0.0.0 (ac1f6b0fa946) = NIC Address[2]: 0.0.0.0 (ac1f6b0fa946)

NIC Address[5]: 0.0.0.0 (ac1f6b0fa947) = NIC Address[3]: 0.0.0.0 (ac1f6b0fa947)

NOT A MATCH <-- NIC Address[1]: fe80:0:0:0:ae1f:6bff:fe0f:a944 (ac1f6b0fa944) /= anything

NIC Address[3]: fe80:0:0:0:ae1f:6bff:fe0f:a945 (ac1f6b0fa945) /= anything

MISSING <-- NIC Address[4] & NIC Address[5] are missing because ipv6 is not present in the fullyResolved entry

The PROBLEM CAUSING E3325009 in this example is that ipv6 is listed in the machineID, but NOT in the fullyResolved entry. Step 6 back in the resolving the problem section, addresses this cause.

2020-06-04@10:21:04.542771 00004493:000083E8 XFCSSVRT: -> fullyResolveMachineId(/10.177.181.91)
2020-06-04@10:21:04.542809 00004493:000083E8 XFCSCONF: machineId is [MachineId: Name: hmc21857, Group: PHMC, MTMS: 7063_CR1-130XXXX, NIC Address[0]: 10.177.181.91 (ac1f6b0fa944) (false), NIC Address[1]: fe80:0:0:0:ae1f:6bff:fe0f:a944 (ac1f6b0fa944) (false), NIC Address[2]: 10.0.255.1 (ac1f6b0fa945) (true), NIC Address[3]: fe80:0:0:0:ae1f:6bff:fe0f:a945 (ac1f6b0fa945) (true), NIC Address[4]: 0.0.0.0 (ac1f6b0fa946) (true), NIC Address[5]: 0.0.0.0 (ac1f6b0fa947) (true)]
2020-06-04@10:21:04.542822 00004493:000083E8 XFCSCONF: machineId.getName() is hmc21857
2020-06-04@10:21:04.542836 00004493:000083E8 XFCSCONF: machineId.getGroup() is PHMC
2020-06-04@10:21:04.543067 00004493:000083E8 XFCSCONF: machineId.getVersion() is 1.0
2020-06-04@10:21:04.543074 00004493:000083E8 XFCSCONF: machineId.getType() is 1
2020-06-04@10:21:04.543081 00004493:000083E8 XFCSCONF: machineId.getFcsVersion() is 1
2020-06-04@10:21:04.543110 00004493:000083E8 XFCSCONF: fullyResolved is [MachineId: Name: hmc21857, Group: PHMC, MTMS: 7063_CR1-130XXXX, NIC Address[0]: 10.177.181.91 (ac1f6b0fa944) (false), NIC Address[1]: 10.0.255.1 (ac1f6b0fa945) (true), NIC Address[2]: 0.0.0.0 (ac1f6b0fa946) (true), NIC Address[3]: 0.0.0.0 (ac1f6b0fa947) (true)] NIC 4 and NIC 5 don't exist here
2020-06-04@10:21:04.543121 00004493:000083E8 XFCSCONF: fullyResolved.getName() is hmc21857
2020-06-04@10:21:04.543127 00004493:000083E8 XFCSCONF: fullyResolved.getGroup() is PHMC
2020-06-04@10:21:04.543132 00004493:000083E8 XFCSCONF: fullyResolved.getVersion() is 1.0
2020-06-04@10:21:04.543142 00004493:000083E8 XFCSCONF: fullyResolved.getType() is 1
2020-06-04@10:21:04.543148 00004493:000083E8 XFCSCONF: fullyResolved.getFcsVersion() is 1
2020-06-04@10:21:04.543159 00004493:000083E8 XFCSCONF: Address /10.177.181.91 was site/link local, and didn't fully resolve to the requested machineId

Another example due to name resolution problems:

The code is attempting to (1) fully resolve the hostname and ipaddress of the HMC MachineId. In the below example, (3) hmc-beta is the actual hostname for ipaddress 10.254.120.2. When the HMC was original setup the hostname was set to (2) hmc8-beta. This is not a match, since the (4) MachineID did not full resolve to exactly the same thing, (5) unable to connect is returned.

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SGGSNP","label":"Hardware Management Console V9"},"ARM Category":[{"code":"a8m0z000000bowEAAQ","label":"Hardware Management Console"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)","Line of Business":{"code":"LOB57","label":"Power"}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"7063-CR1","label":"Hardware Management Console (7063-CR1)"},"ARM Category":[{"code":"a8m0z000000GnS5AAK","label":"HMC"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"9.1.0","Line of Business":{"code":"LOB57","label":"Power"}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"HWQQQP_164","label":"Hardware Management Console 7042"},"ARM Category":[{"code":"a8m0z000000GnS5AAK","label":"HMC"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"9.1.0","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}}]

Callhome SRC E3325009 reported for Version 9

Troubleshooting

Problem

Symptom

Cause

Environment

Diagnosing The Problem

Resolving The Problem

Internal Use Only

Was this topic helpful?

Document Information

UID