TRANSPORT AGENCY WAKAKOTAHI To From Date Subject: ~ n • S9(2)(a) 19 82 26June 2018 NIEMS Production Deployment AC T Purpose M AT IO N This paper documents the production deployment for NIEMS on Friday 8 June 2018 and its subsequent roll back. It summarises the actions that were completed and identifies lessons learned. R Background IN FO It is a 17/18 SPE milestone to launch the National Incident and Event Management System in Wellington and Christchurch. L Project Progress 17/18 E O FF I C IA Following a project reset, a high level roadmap out to 30 June 2018 was developed with the TOCs and agreed by the NIEMS PSG in Nov 2017 then endorsed by the TGG in Dec 2017. See Appendix 1. This roadmap was designed to deliver a first iteration towards the provision of consistent processes, capability and tools to manage incidents and planned events nationally and is considered the first feature set to be delivered under the transport Operating System. ER TH A more detailed roadmap was refined early April 2018 in conjunction with the TOCs and shared with the NIEMS PSG. This was further refined in May 2018. See Appendix 2 and 3. U N D Progress against the roadmap was reported on through the monthly Status Report to the NIEMS PSG and the OS portfolio report to TGG as well as in quarterly SLT reporting. AS E D The project has been tracking to deliver against the 17/18 SPE milestone up until May 2018 where recurring delays through the change process increased the risk of delivery timelines being impacted . Production Deployment Planning R EL E As part of preparing to move ILS production to the cloud, a security assurance report on the Google cloud platform was completed and reviewed by the NIEMS PSG in Feb 2018. Go live dates were discussed and agreed with WTOC Manager. This date was delayed several times due to a combination of Fujitsu readiness for servers and extreme weather. The programme was initially targeting a go live date of 10 May 2018 and finally firmed up as 8 June 2018 for deployment. The TOC Operational Checklist Acceptance Criteria template was completed to prepare for deployment of production into the cloud. This includes sign off for relevant functions as follows: • Design AC T 19 82 • Security • Privacy • Test Planning and Reporting • Early Life Support • Support Operate Phase Release Notes • • Disaster Recovery • Business Sign Off • Project Management Training was not required as this release did not change the operator functionality. It was agreed for the Government Chief Digital Office (GCDO) cloud assessment process for the Google Cloud Platform to be progressed in parallel to this process. IO N The NZTA Fujitsu Change Management Request for Change Form was submitted to Fujitsu from which a Fujitsu work request was generated and included in their weekly change plan. FO R M AT The approval to migrate the ILS database and enable ILS production to be deployed in the cloud was given as an exception to the historical change process but supported by the completion of these standard change forms. IN Deployment C IA L Friday 8 Ju ne 1-5 AM FF I The Project Manager, 2 Developers and the sh ift Team Leader completed the change in conjunction with the Fujitsu DBA on shift. O An impleme ntation test exit report was completed documenting the process. See Appendix 4. ER TH E The process was completed at 4:30 AM and ema ii notification advising the business of the successful data migration was sent at 4.42 AM D Post Deployment D U N Four service desk requests were logged mid-late afternoon on 8 June after the deployment. These related to user login issues and duplicate or triplicate log entries. R EL E AS E The user login issues related to the ITS active directory and were all resolved by 4:30 PM 8 June. Duplicate or triplicate log entries were occurring intermittent ly when populated from TRIES into NIEMS. The duplication issue was confirmed resolved by 7:00 PM 8 June. Further intermittent problems presented as saving issues over the course of the weekend and a new support ticket was raised Sunday 19:30 PM 10 June. The ticket was mis-categorised by Fujitsu as a P3, i.e. fix in business hours. This was not escalated to a P2 until Monday at 8:00 AM 11 June after a discussion with Fujitsu. Five more issues were raised on 11 June for problems presenting as: • Details not saving or saving in the wrong event • Event initiator unable to reopen event in TREIS • Overall system slowness These have been categorised as symptoms of the same underlying problem. Early analysis suggested that the problems may have been due to a database configuration setting. As these are adjustable, some changes were made to try to resolve the issue. The problems were intermittent in their nature and configuration changes did not resolve the underlying problem. 19 82 As part of the problem investigation the developer tried reducing the number of servers but this caused the SCATS team to lose access to ILS for a time. With a severe weather warning in place for the next 24 hours, alternative options were considered to resolve the issues. AC T The decision was taken at 3:00 PM, in conjunction with WTOC to implement a temporary roll back and allow time to find a solution to the problem, whilst providing a stable solution for the operators. Open tickets were manually re-entered back into the old ILS. M AT IO N Problem Resolution R The team completed an investigation into the user problems to identify and resolve the underlying cause of poor response times, page submit errors and system performance. C IA L IN FO The cause was determined to be from the original ILS code using a database access library (Hibernate) at one version and the newer NIEMS release using a later version of the library which had some subtle changes in the management of connections to the database. As a result when usage reached a certain volume level connections would start to become unavailable leading to the intermittent fau Its. E O FF I On Monday 11 June, the team were able to reproduce the errors, first on the staging server and later on local development systems. Code changes were made to resolve the problem. Performance load and stress tests were run and passed. Performance tests for the cloud deployment were also reproduced locally at WTOC and passed. TH The following changes have been made to NIEMS: ER Chan e AS E D U N D Update the ILS code to con-ectly use the database libr~ry Improve the efficiency of the old ILS code by optimising access to the database to reduce the n~ber of reVS ifflp(O'W!fflent e:se Pfootot lmpltmon"'d framewo,1. Ag.... IO UI wo r1(&hop Woritsh0p M AT ls«vice Oe:sign 1 Cumnt:ttm N AC torecasts T OraJIJMC l!E 82 1P Nov Compl«e Incident manage:mtnl plans ~ltment prbritySOP's Appendix 2 Updated NIEMS Roadmap - 6 April 2018 Project roadmap Nati onal Incident and Event Management System (NIEMS) , Draft NIEMS Road Map .~ -.------------------------------------------------~ Completed I\J l • ,; 19 ~ifY"t-1) •woec,..TOC::1 In Progress MOU and croc llj..t:11.c·llC ln!egr.al.: TREIS .IC'CffllOILS T 1_('1 Ntf.Mb.f'lh u1 .."O.,. ~croc -"'"' AC • 82 <- R M AT IO N To Do "'1(1 e SC.Al I J~Olt f t ~ t1Qr1,t11 Jni-eQf'lf.t VIAZi:. Vyd~i Goolllt 0 TH i: ER 1Cloudl Jn Progress ...... lli~;J\erw:_., I U OUa110,,,_,.--, R EL E AS E D 1e91.f• R S IC2 '1l1"",) I ,,...,., N • • al To Do N!EMS:JI D ,,,. ,. 1•Rfl'S -•··\e i N IEIII UI OnlQ"' Appendix 4 Implementation Steps: Details lmplementer(s) Date/Time 1 Create full database backup from /LS SQL Server Database PROD Fujitsu OBA 08/06/2018 01:01 2 Zi p the backup and encry_P,t e.d.wth