Thursday, April 18, 2024

Exadata: Diagnostics using sundiag

 

Exadata: Diagnostics using sundiag

 

  • SunDiag is a system exerciser; its primary purpose is to "stress" Sun hardware devices and record any errors that result from this testing.
  • SunDiag allows multiprocessing systems to run any number of tests on any or all processors.
  • By nature, this program also tests system configurations. SunDiag automatically probes for hardware devices when it is started. If the probe does not find a hardware device that is connected to the system under test, that indicates a problem.

For Help Usage of sundiag.sh:

 

[root@ec117 MegaCli]# /opt/oracle.SupportTools/sundiag.sh -h

 

Oracle Exadata Database Machine - Diagnostics Collection Tool

 

Version: 21.2.11.0.0.220414.1

 

By default sundiag will collect OSWatcher/ExaWatcher, Cell Metrics and traces,

if there was an alert in the last 7 days. If there is more than one alert, latest

alert is chosen to set the time range for data collection.

Time range is 8hrs prior to and 1hr after the latest alert, for the total of 9 hrs

e.g: latest alert timestamp =  2014-03-29T01:20:04-05:00

      echo  Time range = 2014-03-28_16:00:00 and 2014-03-29_01:00:00

User can specify time ranges (as explained below) or ignore_alerts, which takes

precedence over default behavior of checking for alerts

 

Usage: /opt/oracle.SupportTools/sundiag.sh [ignore_alerts | ilom | snapshot] [osw <time ranges>]

   osw      - This argument when used expects value of one or more comma separated

              time ranges. OSWatcher/ExaWatcher, cell metrics and traces will be gathered

              in those time ranges.

              The format for time range(s) is <from>-<to>,<from>-<to> and so on without spaces

              where <from> and <to> format is <date>_<time>

              <date> and <time> format should be any valid format that can be recognized by

              'date' command. The command 'date -d <date>' or 'date -d <time>' should be valid

              e.g: /opt/oracle.SupportTools/sundiag.sh osw 2014/03/31_15:00:00-2014/03/31_18:00:00

              Note: Total time range should not exceed 9 hrs. Only the time ranges that

              fall within this limit are considered for the collection of above data

   ignore_alerts - Skip alert check.

   ilom     - User level ILOM data gathering option via ipmitool, in place of

              separately using root login to get ILOM snapshot over the network.

   nvme_dump - Dump cell nvme diag memory and debug binary data for vendor fault analysis.

   pmem_snapshot - Generate PMEM snapshot as part of generating PMEM info.

              This action is time consuming and will only be done when explicitely selected.

   snapshot - Collects node ILOM snapshot- requires host root password for ILOM

              to send snapshot data over the network.

For gathering ILOM  data alongside sundiag:

For gathering ILOM  data alongside sundiag:

Usage:

# /opt/oracle.SupportTools/sundiag.sh snapshot

Execution will create a date stamped tar.bz2 file in /var/log/exadatatmp/sundiag_<hostname>_<serial#>_<date/time>.tar.bz2 (in /tmp on v1.4-1.5.1) which includes running an ILOM snapshot. In order to collect a snapshot, the host (not ILOM) 'root' password is required to facilitate network transfer of the snapshot into the /tmp directory.  This is the preferred method of ILOM data collection.

If there are concerns about entering the host 'root' password, then an alternative option is provided using the "# /opt/oracle.SupportTools/sundiag.sh ilom" which will use IPMI to gather user-level ILOM outputs. This is usually good but the ILOM snapshot level can provide more underlying ILOM outputs for troubleshooting issues with ILOM and system faults that the user-level data may not provide.

Upload this file to the Service Request.

For gathering sundiag data across a whole rack:

For gathering sundiag.sh outputs on versions of sundiag.sh, where the filename is unique for each node (v1.4 and later), use the following from DB01:

1. [root@exadb01 ~]# cd /opt/oracle.SupportTools/onecommand (or wherever the all_group file is with the list of the rack hostnames)

2. [root@exadb01 onecommand]# dcli -g all_group -l root /opt/oracle.SupportTools/sundiag.sh 2>&1
<this will take up to several minutes while each node runs sundiag.sh>

3. Verify there is output in /tmp or  /var/log/exadatatmp/ on each node:
[root@exadb01 onecommand]# dcli -g all_group -l root --serial 'ls -l /tmp/sundiag* '   (v.1.4-1.5.1)

[root@exadb01 onecommand]# dcli -g all_group -l root --serial 'ls -l /var/log/exadatatmp/sundiag* '   (v12.1.2.2.0_150917)

4. Make a temporary directory to copy for zipping:
[root@exadb01 onecommand]# mkdir dbm01_sundiags_date  

It is recommended the date be of the format YYMMDD year, month, day for SR's where multiple days of analysis may be required.

5. Copy the generated sundiag files from the nodes to the temporary directory (/tmp on v1.4-1.5.1, /var/log/exadatatmp on v12.1.2.2.0_150917):

[root@exadb01 onecommand]# for H in `cat all_group`; do  scp -p $H:/tmp/sundiag*.tar.bz2 dbm01_sundiags_date ; done  

[root@exadb01 onecommand]# for H in `cat all_group`; do  scp -p $H:/var/log/exadatatmp/sundiag*.tar.bz2 dbm01_sundiags_date ; done

6. Bundle them into a single file for upload to Oracle:

[root@exadb01 ~]# tar jcvf exa_rack_sundiag_date.tar.bz2 dbm01_sundiags_date


For gathering sundiag.sh outputs on older versions of sundiag.sh (prior to v1.4) where the filename generated does not include hostname.


When gathering sundiag.sh outputs across a whole rack using dcli, the outputs may end up with the same tarball name which will overwrite each other upon unzipping.  To avoid this, use the following from DB01:


1. [root@exadb01 ~]# cd /opt/oracle.SupportTools/onecommand (or wherever the all_group file is with the list of the rack hostnames)

2. [root@exadb01 onecommand]# dcli -g all_group -l root /opt/oracle.SupportTools/sundiag.sh 2>&1
<this will take up to about 2 minutes>

3. Verify there is output in /tmp on each node:
[root@exadb01 onecommand]# dcli -g all_group -l root --serial 'ls -l /tmp/sundiag* '

4. Sort them by hostname into directories, as they will likely mostly have the same filename with the same date stamp:
[root@exadb01 onecommand]# for H in `cat all_group`; do mkdir /root/rack-sundiag/$H ; scp -p $H:/tmp/sundiag*.tar.bz2 /root/rack-sundiag/$H ; done

5. [root@exadb01 onecommand]# cd /root/rack-sundiag

6. [root@exadb01 ~]# ls exa*
exacel01:
sundiag_2011_05_24_10_11.tar.bz2

exacel02:
sundiag_2011_05_24_10_11.tar.bz2
...
exadb08:
sundiag_2011_05_24_10_11.tar.bz2

7. [root@exadb01 ~]# tar jcvf exa_rack_sundiag_oracle.tar.bz2 exa*
exacel01/
exacel01/sundiag_2011_05_24_10_11.tar.bz2
exacel02/
exacel02/sundiag_2011_05_24_10_11.tar.bz2
...
exadb08/
exadb08/sundiag_2011_05_24_10_11.tar.bz2

8. [root@exadb01 ~]# ls -l exa_rack_sundiag_oracle.tar.bz2
-rw-r--r-- 1 root root 3636112 May 24 10:21 exa_rack_oracle.tar.bz2

Upload this file to the Service Request.

 

For Sun Oracle Exadata Environments:


sundiag.sh is included in the Exadata base image in /opt/oracle.SupportTools:

For image versions 12.1.2.2.0 or later, use the sundiag.sh already included in /opt/oracle.SupportTools/sundiag.sh.

For all systems using any image version prior to 12.1.2.2.0 then update the existing sundiag.sh to the latest v12.1.2.2.0_150917 version.

Note: as of 12.1.2.2.0 image, the version number scheme changed from v1.5.1 as the prior release, to version numbers matching the image release with date.


Updating sundiag.sh to the latest on older image systems.  IMPORTANT:  !!!!!!!!  THIS IS ONLY REQUIRED FOR OLDER IMAGES BEFORE 12.1.2.2.0;  

1. Download file sundiag.zip attached to this Note and copy to the first compute node under /tmp.


2. Using dcli, copy the file to all the nodes and unzip it


      #dcli -l root -g /opt/oracle.SupportTools/onecommand/all_group -f /tmp/sundiag.zip -d /tmp
      #dcli -l root -g /opt/oracle.SupportTools/onecommand/all_group "cd /tmp;unzip sundiag.zip;ls -l sundiag_12.1.2.2.0_150917.sh;md5sum sundiag_12.1.2.2.0_150917.sh"


Output should be like this, for all the nodes referenced in file all_group

nodedb03: Archive: sundiag.zip
nodedb03: inflating: sundiag_12.1.2.2.0_150917.sh
nodedb03: -r-xr-xr-x 1 root root 54919 Sep 17 19:49 sundiag_12.1.2.2.0_150917.sh
nodedb03: 0e6fa48b54d7881b9fc8a252a9b068aa sundiag_12.1.2.2.0_150917.sh

3. Copy the new version of sundiag.sh to the default location


     #dcli -l root -g /opt/oracle.SupportTools/onecommand/all_group "cd /opt/oracle.SupportTools;mv sundiag.sh sundiag.sh.orig"


     #dcli -l root -g /opt/oracle.SupportTools/onecommand/all_group "cd /tmp;mv sundiag_12.1.2.2.0_150917.sh /opt/oracle.SupportTools/sundiag.sh;md5sum /opt/oracle.SupportTools/sundiag.sh;ls -l /opt/oracle.SupportTools/*sundiag*"


4. Remove temporary files

 
    #dcli -l root -g /opt/oracle.SupportTools/onecommand/all_group "cd /tmp;rm -fr sundiag.zip;rm -fr sundiag_12.1.2.2.0_150917.sh"



Execution will create a date stamped tar.bz2 file in /var/log/exadatatmp/sundiag_<hostname>_<serial#>_<date/time>.tar.bz2 (in /tmp on v1.4-1.5.1).
Upload this file to the Service Request.

This list of the files created by version 12.1.2.2.0_150917 of sundiag.sh:

asr
cell
disk
etc_configs
etc_sysconfig_net
fru-print_ipmitool.out
ilom
imagehistory-all.out
imageinfo-all.out
messages
mrdiag
net
osw
RackMasterSN
raid
SerialNumbers
stderr.txt
sysconfig
var_log_cellos
.version_sundiag



For gathering OS Watcher data alongside sundiag:

Usage:

# /opt/oracle.SupportTools/sundiag.sh osw <time ranges>

By default sundiag will collect OSWatcher/ExaWatcher, Cell Metrics and traces, if there was an alert in the last 7 days. If there is more than one alert, the latest alert is chosen to set the time range for data collection. Time range is 8hrs prior to and 1hr after the latest alert, for the total of 9 hrs e.g: latest alert timestamp =  2014-03-29T01:20:04-05:00
      echo  Time range = 2014-03-28_16:00:00 and 2014-03-29_01:00:00

User can also specify time ranges, which takes precedence over default behavior of checking for alerts. This argument when used expects a value of one or more comma separated time ranges. OSWatcher/ExaWatcher, cell metrics and traces will be gathered in those time ranges.

The format for time range(s) is <from>-<to>,<from>-<to> and so on without spaces  where <from> and <to> format is <date>_<time>.
<date> and <time> format should be any valid format that can be recognized by 'date' command. The command 'date -d <date>' or 'date -d <time>' should be valid.

For Example: /opt/oracle.SupportTools/sundiag.sh osw 2014/03/31_15:00:00-2014/03/31_18:00:00

Note: Total time range should not exceed 9 hrs. Only the time ranges that fall within this limit are considered for the collection of above data. This is to limit the amount of data being gathered to be appropriate for the problem being analysed.

Execution will create a date stamped tar.bz2 file in /var/log/exadatatmp/sundiag_<hostname>_<serial#>_<date/time>.tar.bz2 (in /tmp on v1.4-1.5.1) including OS Watcher archive logs. These logs may be very large.

Upload this file to the Service Request.


For gathering ILOM  data alongside sundiag:

Usage:

# /opt/oracle.SupportTools/sundiag.sh snapshot

Execution will create a date stamped tar.bz2 file in /var/log/exadatatmp/sundiag_<hostname>_<serial#>_<date/time>.tar.bz2 (in /tmp on v1.4-1.5.1) which includes running an ILOM snapshot. In order to collect a snapshot, the host (not ILOM) 'root' password is required to facilitate network transfer of the snapshot into the /tmp directory.  This is the preferred method of ILOM data collection.

If there are concerns about entering the host 'root' password, then an alternative option is provided using the "# /opt/oracle.SupportTools/sundiag.sh ilom" which will use IPMI to gather user-level ILOM outputs. This is usually good but the ILOM snapshot level can provide more underlying ILOM outputs for troubleshooting issues with ILOM and system faults that the user-level data may not provide.

Upload this file to the Service Request.

For Sun Oracle Exadata Environments
On each Exadata compute and storage cell nodes, Oracle delivers a utility called sundiag.sh . Bydefault sundiag.sh script is installed in /opt/oracle.SupportTools.When logging Oracle Service Requests, it is common for Oracle Support to request the output of the sundiag.sh utility.

When complete, the output of the sundiag.sh utility will be stored in /tmp with a date-stamped BZ2-compressed file. On both compute servers and storage cells, sundiag.sh generates the following output:

  • o  dmesg command, which contains kernel-level diagnostics from the kernel ring buffer
  • o  fdisk –l, which contains a list of all disk partitions
  • o  lspci, which contains a list of all PCI buses on the system
  • o  lsscsi, which contains a list of all SCSI drives on the system
  • o  MegaCli64, which provides MegaRAID controller diagnostics
  • o  ipmitool sel elist, which queries the ILOM interface for assorted sensor readings for all IPMI enabled devices
  • o  /var/log/messages
  • o  MegaSAS.log, which provides information about your SAS disks 

·    When launched from an Exadata Storage Server, sundiag.sh also collects the following information and the output of below :

  • cellcli list cell detail
  • cellcli list celldisk detail
  • cellcli list lun detail
  • cellcli list physicaldisk detail
  • all physical disks not in a normal state
  • cellcli list griddisk detail
  • cellcli list flashcache detail
  • cellcli list alerthistory
  • storage cell alert.log
  • ms-odl.log
  • ms-odl.trc files
  • Info about PCI flash modules
  • FDOMs, by using the /usr/bin/flash_dom –l command
  • /opt/oracle/cell/cellsrv/deploy/scripts/unix/hwadapter/diskadp /scripts_aura.sh, which provides details about your disk adapters
  • Additional information about your disk devices from the /opt/oracle/cell/cellsrv/deploy/scripts/unix/hwadapter/diskadp/get_disk_devices.pl script


Generating Sundiag Report in my LAB environment 

[root@ec117 oracle.SupportTools]#

[root@ec117 oracle.SupportTools]# /opt/oracle.SupportTools/sundiag.sh

 Oracle Exadata Database Machine - Diagnostics Collection Tool 

Gathering Linux information 

Skipping collection of OSWatcher/ExaWatcher logs, Cell Metrics and Traces Skipping ILOM collection. Use the ilom or snapshot options, or login to ILOM over the network and run Snapshot separately if necessary.

/var/log/exadatatmp/sundiag_ec117__2024_04_18_02_06

Gathering dbms information

Generating diagnostics tarball and removing temp directory

==============================================================================

Done. The report files are bzip2 compressed in /var/log/exadatatmp/sundiag_ec117__2024_04_18_02_06.tar.bz2

==============================================================================

 



 

No comments:

Post a Comment