Skip to main content

Storage Area Network Quarterly Report - Q2 2009

Report Period: April 2009 - June 2009

General Service Availability

There were two breaks in service availability during this period.

23.04.2009 - Volume failure during addition of new disks in Array 3 - See Appendix A

Exchange: 45 minutes intermittent connection and cluster failovers, plus one hour shutdown

ASM VMWare: One hour shutdown

Student email: One hour shutdown

Goya/Constable/Picasso: 45 minutes intermittent connection

12.06.2009 - ESM failure caused storage controller reboot - see Appendix B

Sapphire/Greco: 30 minutes unavailable during service hours

Foo/Baz/Sirius: 7.5 hours unavailable during service hours

Due to the complexity of the system, the exact definition of the SLA target has yet to be specified. It is likely to be a combined SLA for all connected systems, with the overall target being that a percentage of systems meet their individual SLA.

The basic availability figures of the affected systems (based on a 24/7 service clock) are as follows:

System

Downtime

Availability

Exchange

105

99.92%

VMWare1

60

99.95%

Student email

60

99.95%

Goya/Constable/Picasso

45

99.97%

Sapphire/Greco

30

99.98%

Foo/Baz/Sirius2

450

99.65%

1 Two VMWare systems (one production, one not) suffered lost data

2 It is likely that this downtime would not have happened had these systems been running multi-path storage drivers

For details on these two outages, see Appendices at the end of this page.

SAN Storage details

Summary

  TOTAL Array1 Array2 Array3 Array4
Raw capacity total 188.08TB 68.29TB 46.03TB 23.42TB 50.35TB
Raw capacity allocated 167.44TB 65.75TB 34.25TB 23.42TB 44.04TB
Raw capacity unused 20.64TB 2.55TB 11.79TB 0TB 6.32TB
           
Volume groups allocated 121.69TB 54.68TB 26.48TB 17.26TB 33.89TB
Parity/Hot Spare allocated 35.15TB 11.07TB 7.77TB 6.17TB 10.15TB
           
Disks used 564 199 107 114 144
Disks spare 67 7 33 0 27
           
Storage Partitions licensed   16 64 16 16
Storage Partitions used   12 8 6 12
Storage Partitions available   4 56 10 4
           
Tray spaces 40 0 0 19 21

 

Departmental / group usage summary

SAN Usage April-June 2009

 

SAN controller drive loop utilisation

 

 

Drive channels

Tray spaces left

 

 

A1/A2

A3/A4

B1/B2

B3/B4

Loop 1

Loop 2

Array 1

CSM trays
Old trays

1
5

3
4

3
4

1
5

1

0

Array 2

CSM trays
Old trays

1
2

3
4

3
4

1
2

4

0

Array 3

CSM trays
Old trays

3
0

0
8

0
8

3
0

4

0

Array 4

CSM trays
Old trays

3
0

0
8

0
8

3
0

4

0

 

Storage Network (fabric) details

Fibre Channel port utilisation

Site Fabric A Fabric B
In use Available In use Available
UH Machine Room 23 24 22 26
CSC Machine Room 6 18 0 24
Bio Sciences 0 14 0 0
Maths (Backup) 0 14 0 0
WBS Machine Room 8 16 0 24
ITS Machine Room 8 14 8 16
Westwood Test Lab - - - -
Total 45 100 30 90

 

Inter-switch link (ISL) utilisation

Statistics not yet available

 

General SAN storage service

Errors / faults / Incidents

Fault Fix Service Impact Callout Count Fix time
Mirror desynchronisation Manual None No 3 <2h
Read error Auto None No 2 n/a
Volume communication failure Auto None No 3 ~120 sec
Drive failure Replace None Yes 1 6 days1
Controller reset Auto Extensive2 Yes 2 Variable

1 Drive failure happened either side of a Major Incident. No impact to service from the drive failure as Hot Spare replacement was automatic.
2 Both controller resets were due to the faults described in the appendices. Impact was variable, ranging from 30 minutes to loss of data, depending on system.


Service requests

Request Count
New volumes from existing storage 19
Volumes decommissioned 2
Volumes extended 0
Volumes reconfigured 2
Quotation for new storage 2

 


Significant changes this quarter

Two new disk trays added, arrays 01 and 02. These trays have been significantly delayed in their implementation due in part to the fault described in Appendix A.

Sun 5320 storage gateway installed and undergoing initial testing, specifically iSCSI tests.

Nexsan SATAbeast storage unit evaluation completed. The units are considered a valid option for provision of more economical bulk storage. These are intended to be implemented performing specific roles in the general storage service.


Significant changes planned for the future

Move of UOW03 (Exchange) storage node to new University House switches

  • this is now on hold
  • changes to the Exchange infrastructure require more extensive planning and change control work, which may not be justified in this case, considering the proposed changes in the email services.

Move of WBS storage node to CSC Machine Room

  • requires approximately one day downtime for some systems (mainly CSC)
  • requires consultation with and assistance from Sun
  • due to problems encountered this quarter, expected date now Summer 2009

Upgrade of storage controllers to latest firmware

  • requires adequate backup of all data
  • involves significant downtime for all attached systems.
  • approximately 2 hours per array
  • likely split over two days
  • expected date Summer 2009

 

 

Appendix A

Fault
23.04.2009 - Volume failure during addition of new disks in Array 3

Cause
This has been identified by Sun and LSI as resulting from a firmware fault on the tray hardware. It is possible for the current firmware, in conjunction with array software version 6.60, to cause newly inserted disks to replace disks in established volume groups. Where two or more disks are replaced in this way, the volume will fail, and data may be lost.

Impact
The fault resulted in three main impacts.

  • Loss of data from one IT Services VMWare volume group. This was largely supporting test and pre-production systems
  • Intermittent loss of disk connectivity for Exchange and three Netware file servers. It took approximately 45 minutes for all machines to return to a stable state.
  • One hour voluntary shutdown for several systems while the controlled was rebooted and attempts were made to revive the VMWare volume group. This affected Microsoft Exchange, ASM

VMWare systems and student email. Other systems were left running and monitored throughout.

Total downtime

  • Exchange: 45 minutes intermittent connection and cluster failovers, one hour shutdown
  • ASM VMWare: One hour shutdown
  • Student email: One hour shutdown
  • Goya/Constable/Picasso: 45 minutes intermittent connection

Resolution
It is sometimes possible, in conjunction with Sun support, to revive failed disks and recover the data. This was not possible in this case. The fault can be prevented by updating the storage tray firmware to a minimum of 98C4 for the CSM200 enclosures and 9681 for the FLA300 enclosures.

Notes
This has resulted in a significant impact to the storage service as a whole, in that until the firmware is updated, any similar changes on arrays 03 and 04 must include downtime, and all other changes need an increased change control effort in order to prepare and authorise them.

Appendix B

Fault
12.06.2009 - ESM failure caused storage controller reboot

Cause
The root cause of this problem has been identified by Sun and LSI as due to ESM firmware which needs to be updated.

Impact
The fault resulted in three main impacts.

  • Reboot of the controller at 04:43 caused loss of connection for servers SAPPHIRE and GRECO. These were restored to service by 09:25
  • Reboot of the controller (and a related switch of fabric path) caused loss of connection for CSC servers FOO, BAZ and SIRIUS, due to their lack of installed multi-path driver. This is a known issue with the configuration of these servers. The servers were without connection until the disk fault below was rectified and all volume groups were manually swapped back to their original fabric path at 16:30.
  • At the controller reboot, the affected storage tray reported four failed disks, affecting two CSC volume groups. These were, with assistance of Sun support, restored to service by 16:30.

Total downtime

  • Sapphire/Greco: 30 minutes unavailable during service hours
  • Foo/Baz/Sirius: 7.5 hours unavailable during service hours

Resolution
The controller reboot was unexpected and should not have happened as a result of the ESM failure. The newer ESM firmware contains a feature called "Bad Zone Recovery" that may have avoided some or all of the faults seen with this incident.
A multi-path driver installed in the Linux servers should have prevented the loss of connectivity for server BAZ. There would still have been some impact on FOO and SIRIUS, however this should have been restricted to the volumes directly affected by the disk failures and their associated file systems, where more than one such volume is combined to make a larger file system.

Notes
There is now a community approved multi-path driver available for Linux, testing of this is to be scheduled in conjunction with CSC system admins. Provided this is successful it should prevent future loss of connection caused by path failover.