Exadata: Diagnostics using
sundiag
- SunDiag is a system exerciser; its
primary purpose is to "stress" Sun hardware devices and record
any errors that result from this testing.
- SunDiag allows multiprocessing
systems to run any number of tests on any or all processors.
- By nature, this program also tests
system configurations. SunDiag automatically probes for hardware devices
when it is started. If the probe does not find a hardware device that is
connected to the system under test, that indicates a problem.
For Help Usage of
sundiag.sh:
[root@ec117
MegaCli]# /opt/oracle.SupportTools/sundiag.sh -h
Oracle Exadata Database Machine - Diagnostics Collection Tool
Version: 21.2.11.0.0.220414.1
By default sundiag will collect OSWatcher/ExaWatcher, Cell Metrics and
traces,
if there was an alert in the last 7 days. If there is more than one
alert, latest
alert is chosen to set the time range for data collection.
Time range is 8hrs prior to and 1hr after the latest alert, for the total
of 9 hrs
e.g: latest alert timestamp =
2014-03-29T01:20:04-05:00
echo Time range = 2014-03-28_16:00:00 and
2014-03-29_01:00:00
User can specify time ranges (as explained below) or ignore_alerts, which
takes
precedence over default behavior of checking for alerts
Usage: /opt/oracle.SupportTools/sundiag.sh [ignore_alerts | ilom |
snapshot] [osw <time ranges>]
osw - This argument when used expects value
of one or more comma separated
time ranges.
OSWatcher/ExaWatcher, cell metrics and traces will be gathered
in those time ranges.
The format for time
range(s) is <from>-<to>,<from>-<to> and so on without
spaces
where <from>
and <to> format is <date>_<time>
<date> and
<time> format should be any valid format that can be recognized by
'date' command. The
command 'date -d <date>' or 'date -d <time>' should be valid
e.g:
/opt/oracle.SupportTools/sundiag.sh osw 2014/03/31_15:00:00-2014/03/31_18:00:00
Note: Total time
range should not exceed 9 hrs. Only the time ranges that
fall within this
limit are considered for the collection of above data
ignore_alerts - Skip alert
check.
ilom - User level ILOM data gathering option
via ipmitool, in place of
separately using root
login to get ILOM snapshot over the network.
nvme_dump - Dump cell nvme diag
memory and debug binary data for vendor fault analysis.
pmem_snapshot - Generate PMEM
snapshot as part of generating PMEM info.
This action is time
consuming and will only be done when explicitely selected.
snapshot - Collects node ILOM
snapshot- requires host root password for ILOM
to send snapshot data
over the network.
For gathering ILOM data alongside sundiag:
For gathering ILOM
data alongside sundiag:
Usage:
# /opt/oracle.SupportTools/sundiag.sh snapshot
Execution will create a date stamped tar.bz2 file
in /var/log/exadatatmp/sundiag_<hostname>_<serial#>_<date/time>.tar.bz2
(in /tmp on v1.4-1.5.1) which includes running an ILOM snapshot. In order to
collect a snapshot, the host (not ILOM) 'root' password is required to
facilitate network transfer of the snapshot into the /tmp directory. This
is the preferred method of ILOM data collection.
If there are
concerns about entering the host 'root' password, then an alternative option is
provided using the "# /opt/oracle.SupportTools/sundiag.sh ilom" which
will use IPMI to gather user-level ILOM outputs. This is usually good but the
ILOM snapshot level can provide more underlying ILOM outputs for
troubleshooting issues with ILOM and system faults that the user-level data may
not provide.
Upload this file to the Service Request.
For gathering sundiag data across a whole rack:
For gathering sundiag.sh outputs on versions of sundiag.sh, where the
filename is unique for each node (v1.4 and later), use the following from DB01:
1. [root@exadb01 ~]# cd /opt/oracle.SupportTools/onecommand (or wherever
the all_group file is with the list of the rack hostnames)
2. [root@exadb01 onecommand]# dcli -g all_group -l root /opt/oracle.SupportTools/sundiag.sh
2>&1
<this will take up to several minutes while each node runs sundiag.sh>
3. Verify there is output in /tmp or /var/log/exadatatmp/ on each
node:
[root@exadb01 onecommand]# dcli -g all_group -l root --serial 'ls -l /tmp/sundiag*
' (v.1.4-1.5.1)
[root@exadb01 onecommand]# dcli -g all_group -l root --serial 'ls -l
/var/log/exadatatmp/sundiag* ' (v12.1.2.2.0_150917)
4. Make a temporary directory to copy for zipping:
[root@exadb01 onecommand]# mkdir dbm01_sundiags_date
It is recommended the date be of the format YYMMDD year, month, day for
SR's where multiple days of analysis may be required.
5. Copy the generated sundiag files from the nodes to the temporary
directory (/tmp on v1.4-1.5.1, /var/log/exadatatmp on v12.1.2.2.0_150917):
[root@exadb01 onecommand]# for H in `cat all_group`; do scp -p
$H:/tmp/sundiag*.tar.bz2 dbm01_sundiags_date ; done
[root@exadb01 onecommand]# for H in `cat all_group`; do scp -p
$H:/var/log/exadatatmp/sundiag*.tar.bz2 dbm01_sundiags_date ; done
6. Bundle them into a single file for upload to Oracle:
[root@exadb01 ~]# tar jcvf exa_rack_sundiag_date.tar.bz2
dbm01_sundiags_date
For gathering sundiag.sh outputs on
older versions of sundiag.sh (prior to v1.4) where the filename generated does not
include hostname.
When gathering sundiag.sh outputs across a whole rack using dcli, the outputs
may end up with the same tarball name which will overwrite each other upon
unzipping. To avoid this, use the following from DB01:
1.
[root@exadb01 ~]# cd /opt/oracle.SupportTools/onecommand (or wherever the
all_group file is with the list of the rack hostnames)
2. [root@exadb01 onecommand]# dcli -g all_group -l root
/opt/oracle.SupportTools/sundiag.sh 2>&1
<this will take up to about 2 minutes>
3. Verify there is output in /tmp on each node:
[root@exadb01 onecommand]# dcli -g all_group -l root --serial 'ls -l
/tmp/sundiag* '
4. Sort them by hostname into directories, as they will likely mostly have the
same filename with the same date stamp:
[root@exadb01 onecommand]# for H in `cat all_group`; do mkdir
/root/rack-sundiag/$H ; scp -p $H:/tmp/sundiag*.tar.bz2 /root/rack-sundiag/$H ;
done
5. [root@exadb01 onecommand]# cd /root/rack-sundiag
6. [root@exadb01 ~]# ls exa*
exacel01:
sundiag_2011_05_24_10_11.tar.bz2
exacel02:
sundiag_2011_05_24_10_11.tar.bz2
...
exadb08:
sundiag_2011_05_24_10_11.tar.bz2
7. [root@exadb01 ~]# tar jcvf exa_rack_sundiag_oracle.tar.bz2 exa*
exacel01/
exacel01/sundiag_2011_05_24_10_11.tar.bz2
exacel02/
exacel02/sundiag_2011_05_24_10_11.tar.bz2
...
exadb08/
exadb08/sundiag_2011_05_24_10_11.tar.bz2
8. [root@exadb01 ~]# ls -l exa_rack_sundiag_oracle.tar.bz2
-rw-r--r-- 1 root root 3636112 May 24 10:21 exa_rack_oracle.tar.bz2
Upload this file to the Service Request.
For Sun Oracle Exadata
Environments:
sundiag.sh is included in the Exadata base image in /opt/oracle.SupportTools:
For image versions 12.1.2.2.0 or later, use the sundiag.sh already
included in /opt/oracle.SupportTools/sundiag.sh.
For all systems using any image version prior to 12.1.2.2.0
then update the existing sundiag.sh to the latest v12.1.2.2.0_150917 version.
Note: as of 12.1.2.2.0 image, the version number scheme changed from v1.5.1 as
the prior release, to version numbers matching the image release with date.
Updating sundiag.sh to the latest on older image systems.
IMPORTANT: !!!!!!!! THIS IS ONLY REQUIRED FOR OLDER IMAGES BEFORE
12.1.2.2.0;
1. Download file sundiag.zip attached to this Note and
copy to the first compute node under /tmp.
2. Using dcli, copy the file to all the nodes and unzip it
#dcli -l root -g
/opt/oracle.SupportTools/onecommand/all_group -f /tmp/sundiag.zip -d /tmp
#dcli -l root -g
/opt/oracle.SupportTools/onecommand/all_group "cd /tmp;unzip
sundiag.zip;ls -l sundiag_12.1.2.2.0_150917.sh;md5sum
sundiag_12.1.2.2.0_150917.sh"
Output should be like this, for all the nodes referenced in file all_group
nodedb03: Archive: sundiag.zip
nodedb03: inflating: sundiag_12.1.2.2.0_150917.sh
nodedb03: -r-xr-xr-x 1 root root 54919 Sep 17 19:49
sundiag_12.1.2.2.0_150917.sh
nodedb03: 0e6fa48b54d7881b9fc8a252a9b068aa sundiag_12.1.2.2.0_150917.sh
3. Copy the new version of sundiag.sh to the default location
#dcli -l root -g
/opt/oracle.SupportTools/onecommand/all_group "cd
/opt/oracle.SupportTools;mv sundiag.sh sundiag.sh.orig"
#dcli -l root -g
/opt/oracle.SupportTools/onecommand/all_group "cd /tmp;mv
sundiag_12.1.2.2.0_150917.sh /opt/oracle.SupportTools/sundiag.sh;md5sum
/opt/oracle.SupportTools/sundiag.sh;ls -l
/opt/oracle.SupportTools/*sundiag*"
4. Remove temporary files
#dcli -l root -g
/opt/oracle.SupportTools/onecommand/all_group "cd /tmp;rm -fr
sundiag.zip;rm -fr sundiag_12.1.2.2.0_150917.sh"
Execution will create a date stamped tar.bz2 file in
/var/log/exadatatmp/sundiag_<hostname>_<serial#>_<date/time>.tar.bz2
(in /tmp on v1.4-1.5.1).
Upload this file to the Service Request.
This list of the files created by version 12.1.2.2.0_150917 of
sundiag.sh:
asr
cell
disk
etc_configs
etc_sysconfig_net
fru-print_ipmitool.out
ilom
imagehistory-all.out
imageinfo-all.out
messages
mrdiag
net
osw
RackMasterSN
raid
SerialNumbers
stderr.txt
sysconfig
var_log_cellos
.version_sundiag
For
gathering OS Watcher data alongside sundiag:
Usage:
# /opt/oracle.SupportTools/sundiag.sh osw <time ranges>
By default sundiag will collect OSWatcher/ExaWatcher, Cell Metrics
and traces, if there was an alert in the last 7 days. If there is more than one
alert, the latest alert is chosen to set the time range for data collection.
Time range is 8hrs prior to and 1hr after the latest alert, for the total of 9
hrs e.g: latest alert timestamp = 2014-03-29T01:20:04-05:00
echo Time range = 2014-03-28_16:00:00 and
2014-03-29_01:00:00
User can also specify time ranges, which takes precedence over
default behavior of checking for alerts. This argument when used expects a
value of one or more comma separated time ranges. OSWatcher/ExaWatcher, cell
metrics and traces will be gathered in those time ranges.
The format for time range(s) is
<from>-<to>,<from>-<to> and so on without spaces
where <from> and <to> format is <date>_<time>.
<date> and <time> format should be any valid format that can be
recognized by 'date' command. The command 'date -d <date>' or 'date -d
<time>' should be valid.
For Example: /opt/oracle.SupportTools/sundiag.sh osw
2014/03/31_15:00:00-2014/03/31_18:00:00
Note: Total time range should not exceed 9 hrs. Only the time
ranges that fall within this limit are considered for the collection of above
data. This is to limit the amount of data being gathered to be appropriate for
the problem being analysed.
Execution will create a date stamped tar.bz2 file in
/var/log/exadatatmp/sundiag_<hostname>_<serial#>_<date/time>.tar.bz2
(in /tmp on v1.4-1.5.1) including OS Watcher archive logs. These logs may be
very large.
Upload this file to the Service Request.
For gathering ILOM data alongside sundiag:
Usage:
# /opt/oracle.SupportTools/sundiag.sh snapshot
Execution will create a date stamped tar.bz2 file
in /var/log/exadatatmp/sundiag_<hostname>_<serial#>_<date/time>.tar.bz2
(in /tmp on v1.4-1.5.1) which includes running an ILOM snapshot. In order to collect
a snapshot, the host (not ILOM) 'root' password is required to facilitate
network transfer of the snapshot into the /tmp directory. This is the
preferred method of ILOM data collection.
If there are concerns about entering the host 'root' password, then
an alternative option is provided using the "#
/opt/oracle.SupportTools/sundiag.sh ilom" which will use IPMI to gather
user-level ILOM outputs. This is usually good but the ILOM snapshot level can
provide more underlying ILOM outputs for troubleshooting issues with ILOM and
system faults that the user-level data may not provide.
Upload this file to the Service Request.
For Sun Oracle Exadata
Environments
On each Exadata compute and storage cell nodes, Oracle delivers a utility
called sundiag.sh . Bydefault sundiag.sh script is installed in
/opt/oracle.SupportTools.When logging Oracle Service Requests, it is common for
Oracle Support to request the output of the sundiag.sh utility.
When complete, the output of the sundiag.sh utility will be
stored in /tmp with a date-stamped BZ2-compressed file. On both compute servers
and storage cells, sundiag.sh generates the following output:
- o dmesg command, which contains kernel-level
diagnostics from the kernel ring buffer
- o fdisk –l, which contains a list of all disk
partitions
- o lspci, which contains a list of all PCI buses on
the system
- o lsscsi, which contains a list of all SCSI drives
on the system
- o MegaCli64, which provides MegaRAID controller
diagnostics
- o ipmitool sel elist, which queries the ILOM
interface for assorted sensor readings for all IPMI enabled devices
- o /var/log/messages
- o MegaSAS.log, which provides information about your SAS disks
· When launched from an Exadata Storage Server, sundiag.sh also collects the following information and the output of below :
- cellcli list cell detail
- cellcli list celldisk detail
- cellcli list lun detail
- cellcli list physicaldisk detail
- all physical disks not in a normal state
- cellcli list griddisk detail
- cellcli list flashcache detail
- cellcli list alerthistory
- storage cell alert.log
- ms-odl.log
- ms-odl.trc files
- Info about PCI flash modules
- FDOMs, by using the /usr/bin/flash_dom –l command
- /opt/oracle/cell/cellsrv/deploy/scripts/unix/hwadapter/diskadp /scripts_aura.sh, which provides details about your disk adapters
- Additional information about your disk devices from the /opt/oracle/cell/cellsrv/deploy/scripts/unix/hwadapter/diskadp/get_disk_devices.pl script
Generating Sundiag Report in my LAB environment
[root@ec117 oracle.SupportTools]#
[root@ec117 oracle.SupportTools]#
/opt/oracle.SupportTools/sundiag.sh
Gathering Linux information
Skipping collection of OSWatcher/ExaWatcher logs, Cell Metrics and Traces Skipping ILOM collection. Use the ilom or snapshot options, or login to ILOM over the network and run Snapshot separately if necessary.
/var/log/exadatatmp/sundiag_ec117__2024_04_18_02_06
Gathering dbms information
Generating diagnostics tarball and removing temp directory
==============================================================================
Done. The
report files are bzip2 compressed in
/var/log/exadatatmp/sundiag_ec117__2024_04_18_02_06.tar.bz2
==============================================================================
No comments:
Post a Comment