Vault Authentication Service Outage - Summary
Vault Authentication Service Outage - Summary
May 15 2024,9:30am PDT
May 15 2024,9:30am PDT
Outage Retrospective and Next Steps
The following is an update regarding the Vault outages that occurred on 22, 23, and 24 April 2024.
The Incident Report was made available on May 7 and details the changes implemented, particularly on the evening of 24 April, that led to Vault Auth stability. The Incident Report is available via Support or your account team.
The six areas of Corrective and Preventative Actions (CAPAs) tracked in Veeva’s QMS, covering the following areas:
VAULT AUTHENTICATION SERVICES ARCHITECTURE
VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE
CONTINUED INVESTIGATION
TEST ENHANCEMENTS
MONITORING AND LOGGING
VAULT CAPACITY PLANNING
The detailed work to size and scope the specific action is now underway, with a target of August 7th for a definitive set of plans. In addition to the formal CAPA process, Vault engineering is undergoing a complete code review to identify any other, similar, code inefficiencies. These changes will be available once fully tested. In addition, more awareness and oversight has been given to Vault application design and coding practices pertaining to the use of Vault’s Auth service.
For further details, please refer to the Incident Report.
May 31 2024,2:22pm PDT
May 31 2024,2:22pm PDT
Outage Retrospective (31 May 2024)
The following is an update regarding the Vault outages that occurred on 22, 23, and 24 April 2024.
Recall from the last update (15 May) that the Incident Report (IR) was made available on 7 May. The IR identified six areas of Corrective and Preventative Actions (CAPAs) tracked in Veeva’s QMS, covering the following areas:
VAULT AUTHENTICATION SERVICES ARCHITECTURE
VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE
CONTINUED INVESTIGATION
TEST ENHANCEMENTS
MONITORING AND LOGGING
VAULT CAPACITY PLANNING
To date there are 12 groupings of Development efforts applied across the 6 CAPAs. Each grouping (or "epic") encompasses a range of discrete efforts. Many efforts have been addressed. Many are currently being worked. Others have been or are being scoped and assigned.
To illustrate, one epic exists to reduce Vault Auth calls by caching operational metadata information. This particular epic pertains to the CAPA: VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE. There are 100 discrete work efforts within this epic. As of this update, 43 of the items have been addressed.
Many of the other epics are of a similar scale and composition.
The work done to categorize, size, assign, and work these items puts us in a good position to meet the 7 August due date to assess the ultimate completion dates of all discrete efforts representing a given CAPA.
This work is over and above the awareness and oversight given to Vault application design and coding practices pertaining to the use of Vault’s Auth service.
August 12 2024,1:28pm PDT
August 12 2024,1:28pm PDT
Outage Retrospective and Next Steps (12 August 2024)
The following is an update regarding actions taken by Vault teams to address the root causes of the Vault Auth outages that occurred on 22, 23, and 24 April 2024.
Recall that an Incident Report was provided on 7 May 2024 that detailed the changes implemented to that time, particularly on the evening of 24 April 2024, that led to Vault Authentication Services (“Vault Auth”) stability. (The Incident Report is accessible via your Veeva account team members.)
The Incident Report further outlined six Corrective and Preventative Actions (CAPAs) tracked in Veeva’s QMS, covering the following areas:
VAULT AUTHENTICATION SERVICES ARCHITECTURE
VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE
CONTINUED INVESTIGATION
TEST ENHANCEMENTS
MONITORING AND LOGGING
VAULT CAPACITY PLANNING
Each CAPA represents a significant effort on its own. In the months since the outage events, Vault teams have engaged in concentrated efforts to complete the CAPAs. While a target of 7 August 2024 was initially established to determine completion dates for the six CAPAs, and this has been accomplished, much work has gone into addressing each CAPA in addition to sizing and scoping remaining items. As a consequence of the work starting at the time of the event continuing unabated since, Vault Auth pressure has been significantly reduced and Vault Auth stability has been attained.
Remaining efforts now mostly entail complex architectural modifications requiring multiple releases to implement. The aim of these efforts is to allow for continued Vault platform growth and stability for years to come.
What follows is a summary of each CAPA with their respective anticipated release completion timeframes.
VAULT AUTHENTICATION SERVICES ARCHITECTURE
[DEV-714916] Enhancements to Vault Auth architecture
Target release for completion: 25R1
A key aspect of this CAPA is to offload Vault Auth of all but user authentication handling responsibility–a significant architectural change. This and other efforts represented by this CAPA are divided among 4 separate development threads. They encompass, in total, 45 individual work efforts, 22 of which have been closed. The remaining items require work efforts that span releases given their complexity and scope.
VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE
[DEV-714918] Enhancements to Vault application code and Vault infrastructure to reduce Vault Auth calls
Target release for completion: 24R3.3
This effort constitutes 8 major efforts. Combined, these efforts specify 326 individual work efforts, of which 284 are completed. Remaining efforts targeted for the 24R3.3 Limited Release will materialize in the 25R1 General Release.
CONTINUED INVESTIGATION
[DEV-714923] A comprehensive investigation continues into the outages
Target release for completion: 25R1
18 of the 20 tasks associated with this effort have been completed. The CAPA remains open to allow for the remaining two items to be accomplished in a deliberative manner.
TEST ENHANCEMENTS
[DEV-714924] Expand test processes to identify excessive Vault Auth accesses earlier in the development cycle
RESOLVED
This CAPA consisted of 3 discrete efforts that stress tested Vault Auth and led to scripts that can be reapplied in ongoing performance testing.
MONITORING AND LOGGING
[DEV-714926] Improve Vault monitoring tools to support more granular analysis when events occur
Target release for completion: 25R1
There are a total of 9 tickets constituting this CAPA, with 4 completed to date. Remaining items are architecturally complex and will take time to implement.
VAULT CAPACITY PLANNING
[DEV-714928] Ensure that expansion of Vault PODs and Vault instances accounts for impacts to Vault Auth
RESOLVED
This effort introduces enhancements to documentation and tracking of Vault POD and Vault capacity planning to reflect additional context into Vault POD and Vault capacity planning processes.
In summary, Veeva continues to harden Vault Auth while simultaneously reducing its scope of activities. The actions taken to date have led to gains in stability. Remaining actions build on this momentum and improve the scalability and reliability of Vault Auth to accommodate future expansion.
© 2024 Veeva Trust Site