Vault Authentication Service Outage - Summary
Vault Authentication Service Outage - Summary
May 15 2024,9:30am PDT
May 15 2024,9:30am PDT
Outage Retrospective and Next Steps
The following is an update regarding the Vault outages that occurred on 22, 23, and 24 April 2024.
The Incident Report was made available on May 7 and details the changes implemented, particularly on the evening of 24 April, that led to Vault Auth stability. The Incident Report is available via Support or your account team.
The six areas of Corrective and Preventative Actions (CAPAs) tracked in Veeva’s QMS, covering the following areas:
VAULT AUTHENTICATION SERVICES ARCHITECTURE
VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE
CONTINUED INVESTIGATION
TEST ENHANCEMENTS
MONITORING AND LOGGING
VAULT CAPACITY PLANNING
The detailed work to size and scope the specific action is now underway, with a target of August 7th for a definitive set of plans. In addition to the formal CAPA process, Vault engineering is undergoing a complete code review to identify any other, similar, code inefficiencies. These changes will be available once fully tested. In addition, more awareness and oversight has been given to Vault application design and coding practices pertaining to the use of Vault’s Auth service.
For further details, please refer to the Incident Report.
May 31 2024,2:22pm PDT
May 31 2024,2:22pm PDT
Outage Retrospective (31 May 2024)
The following is an update regarding the Vault outages that occurred on 22, 23, and 24 April 2024.
Recall from the last update (15 May) that the Incident Report (IR) was made available on 7 May. The IR identified six areas of Corrective and Preventative Actions (CAPAs) tracked in Veeva’s QMS, covering the following areas:
VAULT AUTHENTICATION SERVICES ARCHITECTURE
VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE
CONTINUED INVESTIGATION
TEST ENHANCEMENTS
MONITORING AND LOGGING
VAULT CAPACITY PLANNING
To date there are 12 groupings of Development efforts applied across the 6 CAPAs. Each grouping (or "epic") encompasses a range of discrete efforts. Many efforts have been addressed. Many are currently being worked. Others have been or are being scoped and assigned.
To illustrate, one epic exists to reduce Vault Auth calls by caching operational metadata information. This particular epic pertains to the CAPA: VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE. There are 100 discrete work efforts within this epic. As of this update, 43 of the items have been addressed.
Many of the other epics are of a similar scale and composition.
The work done to categorize, size, assign, and work these items puts us in a good position to meet the 7 August due date to assess the ultimate completion dates of all discrete efforts representing a given CAPA.
This work is over and above the awareness and oversight given to Vault application design and coding practices pertaining to the use of Vault’s Auth service.
August 12 2024,1:28pm PDT
August 12 2024,1:28pm PDT
Outage Retrospective and Next Steps (12 August 2024)
The following is an update regarding actions taken by Vault teams to address the root causes of the Vault Auth outages that occurred on 22, 23, and 24 April 2024.
Recall that an Incident Report was provided on 7 May 2024 that detailed the changes implemented to that time, particularly on the evening of 24 April 2024, that led to Vault Authentication Services (“Vault Auth”) stability. (The Incident Report is accessible via your Veeva account team members.)
The Incident Report further outlined six Corrective and Preventative Actions (CAPAs) tracked in Veeva’s QMS, covering the following areas:
VAULT AUTHENTICATION SERVICES ARCHITECTURE
VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE
CONTINUED INVESTIGATION
TEST ENHANCEMENTS
MONITORING AND LOGGING
VAULT CAPACITY PLANNING
Each CAPA represents a significant effort on its own. In the months since the outage events, Vault teams have engaged in concentrated efforts to complete the CAPAs. While a target of 7 August 2024 was initially established to determine completion dates for the six CAPAs, and this has been accomplished, much work has gone into addressing each CAPA in addition to sizing and scoping remaining items. As a consequence of the work starting at the time of the event continuing unabated since, Vault Auth pressure has been significantly reduced and Vault Auth stability has been attained.
Remaining efforts now mostly entail complex architectural modifications requiring multiple releases to implement. The aim of these efforts is to allow for continued Vault platform growth and stability for years to come.
What follows is a summary of each CAPA with their respective anticipated release completion timeframes.
VAULT AUTHENTICATION SERVICES ARCHITECTURE
[DEV-714916] Enhancements to Vault Auth architecture
Target release for completion: 25R1
A key aspect of this CAPA is to offload Vault Auth of all but user authentication handling responsibility–a significant architectural change. This and other efforts represented by this CAPA are divided among 4 separate development threads. They encompass, in total, 45 individual work efforts, 22 of which have been closed. The remaining items require work efforts that span releases given their complexity and scope.
VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE
[DEV-714918] Enhancements to Vault application code and Vault infrastructure to reduce Vault Auth calls
Target release for completion: 24R3.3
This effort constitutes 8 major efforts. Combined, these efforts specify 326 individual work efforts, of which 284 are completed. Remaining efforts targeted for the 24R3.3 Limited Release will materialize in the 25R1 General Release.
CONTINUED INVESTIGATION
[DEV-714923] A comprehensive investigation continues into the outages
Target release for completion: 25R1
18 of the 20 tasks associated with this effort have been completed. The CAPA remains open to allow for the remaining two items to be accomplished in a deliberative manner.
TEST ENHANCEMENTS
[DEV-714924] Expand test processes to identify excessive Vault Auth accesses earlier in the development cycle
RESOLVED
This CAPA consisted of 3 discrete efforts that stress tested Vault Auth and led to scripts that can be reapplied in ongoing performance testing.
MONITORING AND LOGGING
[DEV-714926] Improve Vault monitoring tools to support more granular analysis when events occur
Target release for completion: 25R1
There are a total of 9 tickets constituting this CAPA, with 4 completed to date. Remaining items are architecturally complex and will take time to implement.
VAULT CAPACITY PLANNING
[DEV-714928] Ensure that expansion of Vault PODs and Vault instances accounts for impacts to Vault Auth
RESOLVED
This effort introduces enhancements to documentation and tracking of Vault POD and Vault capacity planning to reflect additional context into Vault POD and Vault capacity planning processes.
In summary, Veeva continues to harden Vault Auth while simultaneously reducing its scope of activities. The actions taken to date have led to gains in stability. Remaining actions build on this momentum and improve the scalability and reliability of Vault Auth to accommodate future expansion.
March 04 2025,1:53pm PST
March 04 2025,1:53pm PST
Late-April 2024 Outage Retrospective and Update
The following is an update regarding the Vault Authentication Services (“Vault Auth”, “Auth”) outage that occurred in late April 2024.
In the months following the event, hundreds of discrete Development work efforts have been implemented under the auspices of CAPAs detailed in the outage Incident Report. These fixes provide long term stabilization and were additive to the efforts implemented at the time of the event. In all, Vault Authentication Service has proved stable and reliable in the time following the April 2024 event.
The outage Incident Report outlined six Corrective and Preventive Actions (CAPAs) tracked in Veeva’s QMS that address the following areas:
VAULT AUTHENTICATION SERVICES ARCHITECTURE
VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE
CONTINUED INVESTIGATION
TEST ENHANCEMENTS
MONITORING AND LOGGING
VAULT CAPACITY PLANNING
A summary of each CAPA follows.
VAULT AUTHENTICATION SERVICES ARCHITECTURE
[DEV-714916] Enhancements to Vault Auth architecture
Status: In process Due Date: 9 May 2025
A key goal of this CAPA is to reduce Vault Auth scope of responsibilities. Only large-scale efforts remain in this multi-release effort. Remaining actions are anticipated to be completed with the 25R1 General Release, at which time the CAPA will be moved to Pending Effectiveness Check.
VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE
[DEV-714918] Enhancements to Vault application code and Vault infrastructure to reduce Vault Auth calls
Status: Awaiting dependency Due Date: 9 May 2025
Code changes are complete and will be available in the upcoming 25R1 General Release. The bulk of the efforts focus on reducing Auth overhead. When 25R1 is released in April, the CAPA will be moved to Pending Effectiveness Check.
CONTINUED INVESTIGATION
[DEV-714923] A comprehensive investigation continues into the outages
Status: Closed
After a thorough investigation, Vault Development teams discerned that Auth database query latency increased due to SSL lock contention when under heavy load. The causative factors noted in the Incident Report, when combined with this finding, entirely explains the nature of the incident. The SSL libraries that manifested this behavior under load are deeply rooted in the Vault platform and are scheduled for upgrade in the 25R2 General Release timeframe. Work efforts noted elsewhere have significantly diminished Vault Auth loads, and have mitigated the risk of a recurrence.
TEST ENHANCEMENTS
[DEV-714924] Expand test processes to identify excessive Vault Auth accesses earlier in the development cycle
Status: Pending Effectiveness Check
This effort is complete and its effectiveness will be evaluated by Veeva Q&C at the end of the Effectiveness Check period.
MONITORING AND LOGGING
[DEV-714926] Improve Vault monitoring tools to support more granular analysis when events occur
Status: In process Due Date: 9 May 2025
This effort consists of smaller completed efforts and larger, multi-release efforts that are nearing completion. Remaining actions anticipated to be completed with the 25R1 General Release, at which time the CAPA will be closed.
VAULT CAPACITY PLANNING
[DEV-714928] Ensure that expansion of Vault PODs and Vault instances accounts for impacts to Vault Auth
Status: Pending Effectiveness Check
This effort is complete and effectiveness will be evaluated by Veeva Q&C at the end of the Effectiveness Check period.
In summary, Vault Authentication Services has demonstrated a track record of stability and performance in the months following the April 2024 outage. Many hundreds of code fixes have been applied and remaining work efforts that span multiple releases are drawing to a close. Effectiveness Checks exist to confirm the reliability and performance from a long-term perspective. In all, these actions have led to a more durable and scalable authentication framework.
A final post in the summer 2025 timeframe will conclude the topic.
© 2024 Veeva Trust Site