QualityOne Station Manager
Veeva Vault PODs in the Asia Pacific and US West Coast Regions are currently experiencing intermittent degraded performance. Veeva engineering teams are working to return the service to normal as quickly as possible.
Affected Veeva Vault PODs are now fully available with all services operating normally.
What: CP QualityOne Station Manager (iOS) 24R2.0
- Visit the About the 24R2 Release page to review the schedule and other release details.
- Read about What's New in 24R2 release notes for in-depth information about the new features.
- Reference the Latest Announcements for release-specific information.
Scheduled Maintenance Closed.
Vault CDMS 24R1.0.33 HF is scheduled for production release on 07/25/2024 to CDMS PODs:
VV1-22, VV1-1090, VV2-2092, VV1-1127, VV1-1139, VV1-1148, VV1-1159, VV1-1161, VV1-1174, VV1-1175, VV1-1193, VV2-2142, VV3-3121, VV1-1194
The CDMS Utility services will not be available from 6:00 PM to 8:30 PM PST.
Scheduled Maintenance closed.
Outage Retrospective and Next Steps
The following is an update regarding the Vault outages that occurred on 22, 23, and 24 April 2024.
The Incident Report was made available on May 7 and details the changes implemented, particularly on the evening of 24 April, that led to Vault Auth stability. The Incident Report is available via Support or your account team.
The six areas of Corrective and Preventative Actions (CAPAs) tracked in Veeva’s QMS, covering the following areas:
VAULT AUTHENTICATION SERVICES ARCHITECTURE
VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE
CONTINUED INVESTIGATION
TEST ENHANCEMENTS
MONITORING AND LOGGING
VAULT CAPACITY PLANNING
The detailed work to size and scope the specific action is now underway, with a target of August 7th for a definitive set of plans. In addition to the formal CAPA process, Vault engineering is undergoing a complete code review to identify any other, similar, code inefficiencies. These changes will be available once fully tested. In addition, more awareness and oversight has been given to Vault application design and coding practices pertaining to the use of Vault’s Auth service.
For further details, please refer to the Incident Report.
Outage Retrospective (31 May 2024)
The following is an update regarding the Vault outages that occurred on 22, 23, and 24 April 2024.
Recall from the last update (15 May) that the Incident Report (IR) was made available on 7 May. The IR identified six areas of Corrective and Preventative Actions (CAPAs) tracked in Veeva’s QMS, covering the following areas:
VAULT AUTHENTICATION SERVICES ARCHITECTURE
VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE
CONTINUED INVESTIGATION
TEST ENHANCEMENTS
MONITORING AND LOGGING
VAULT CAPACITY PLANNING
To date there are 12 groupings of Development efforts applied across the 6 CAPAs. Each grouping (or "epic") encompasses a range of discrete efforts. Many efforts have been addressed. Many are currently being worked. Others have been or are being scoped and assigned.
To illustrate, one epic exists to reduce Vault Auth calls by caching operational metadata information. This particular epic pertains to the CAPA: VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE. There are 100 discrete work efforts within this epic. As of this update, 43 of the items have been addressed.
Many of the other epics are of a similar scale and composition.
The work done to categorize, size, assign, and work these items puts us in a good position to meet the 7 August due date to assess the ultimate completion dates of all discrete efforts representing a given CAPA.
This work is over and above the awareness and oversight given to Vault application design and coding practices pertaining to the use of Vault’s Auth service.
Outage Retrospective and Next Steps (12 August 2024)
The following is an update regarding actions taken by Vault teams to address the root causes of the Vault Auth outages that occurred on 22, 23, and 24 April 2024.
Recall that an Incident Report was provided on 7 May 2024 that detailed the changes implemented to that time, particularly on the evening of 24 April 2024, that led to Vault Authentication Services (“Vault Auth”) stability. (The Incident Report is accessible via your Veeva account team members.)
The Incident Report further outlined six Corrective and Preventative Actions (CAPAs) tracked in Veeva’s QMS, covering the following areas:
VAULT AUTHENTICATION SERVICES ARCHITECTURE
VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE
CONTINUED INVESTIGATION
TEST ENHANCEMENTS
MONITORING AND LOGGING
VAULT CAPACITY PLANNING
Each CAPA represents a significant effort on its own. In the months since the outage events, Vault teams have engaged in concentrated efforts to complete the CAPAs. While a target of 7 August 2024 was initially established to determine completion dates for the six CAPAs, and this has been accomplished, much work has gone into addressing each CAPA in addition to sizing and scoping remaining items. As a consequence of the work starting at the time of the event continuing unabated since, Vault Auth pressure has been significantly reduced and Vault Auth stability has been attained.
Remaining efforts now mostly entail complex architectural modifications requiring multiple releases to implement. The aim of these efforts is to allow for continued Vault platform growth and stability for years to come.
What follows is a summary of each CAPA with their respective anticipated release completion timeframes.
VAULT AUTHENTICATION SERVICES ARCHITECTURE
[DEV-714916] Enhancements to Vault Auth architecture
Target release for completion: 25R1
A key aspect of this CAPA is to offload Vault Auth of all but user authentication handling responsibility–a significant architectural change. This and other efforts represented by this CAPA are divided among 4 separate development threads. They encompass, in total, 45 individual work efforts, 22 of which have been closed. The remaining items require work efforts that span releases given their complexity and scope.
VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE
[DEV-714918] Enhancements to Vault application code and Vault infrastructure to reduce Vault Auth calls
Target release for completion: 24R3.3
This effort constitutes 8 major efforts. Combined, these efforts specify 326 individual work efforts, of which 284 are completed. Remaining efforts targeted for the 24R3.3 Limited Release will materialize in the 25R1 General Release.
CONTINUED INVESTIGATION
[DEV-714923] A comprehensive investigation continues into the outages
Target release for completion: 25R1
18 of the 20 tasks associated with this effort have been completed. The CAPA remains open to allow for the remaining two items to be accomplished in a deliberative manner.
TEST ENHANCEMENTS
[DEV-714924] Expand test processes to identify excessive Vault Auth accesses earlier in the development cycle
RESOLVED
This CAPA consisted of 3 discrete efforts that stress tested Vault Auth and led to scripts that can be reapplied in ongoing performance testing.
MONITORING AND LOGGING
[DEV-714926] Improve Vault monitoring tools to support more granular analysis when events occur
Target release for completion: 25R1
There are a total of 9 tickets constituting this CAPA, with 4 completed to date. Remaining items are architecturally complex and will take time to implement.
VAULT CAPACITY PLANNING
[DEV-714928] Ensure that expansion of Vault PODs and Vault instances accounts for impacts to Vault Auth
RESOLVED
This effort introduces enhancements to documentation and tracking of Vault POD and Vault capacity planning to reflect additional context into Vault POD and Vault capacity planning processes.
In summary, Veeva continues to harden Vault Auth while simultaneously reducing its scope of activities. The actions taken to date have led to gains in stability. Remaining actions build on this momentum and improve the scalability and reliability of Vault Auth to accommodate future expansion.
What: CP QualityOne Audit Checklist Mobile, QualityOne Station Manager (iOS) and QualityOne Audit Checklist Mobile (Android) 24R1.0
- Visit the About the 24R1 Release page to review the schedule and other release details.
- Read about What's New in 24R1 release notes for in-depth information about the new features.
- Reference the Latest Announcements for release-specific information.
Scheduled Maintenance Closed.
Vault Outage - Authentication Service
2024.04.25
Outage and Current Status
The recent issues with load on the authentication cluster resulted in the following outages:
Post Release 24R1: Monday, April 22, Tuesday, April 23, Wednesday, April 24
Post Infrastructure upgrade: Monday, March 25
As of April 25th, we have applied code fixes that appear to have stabilized the system. We will have full confidence once the service has run for a week or more without downtime. There was no security issue related to customer data, no loss of customer data, and the outage was not a result of a security attack by an outside party.
The authentication cluster (many virtual computers) is a central point of failure because it controls security access. Security of course is critically important. If the whole authentication cluster is down, all Vaults are down. The authentication cluster is designed to be redundant and highly available and historically has been until this recent outage.
In each outage case, the authentication cluster became overloaded and unable to process requests. The Veeva Engineering and Operations teams restored service by stopping the service, clearing the traffic and doing rolling, gradual restarts of the global Vault Points of Delivery (PODs). The restoration of service resulted in outages varying from 1-3 hours depending on the POD.
We take our commitment to delivering a reliable and performant service very seriously. With these outages, we recognize that we're falling short on that commitment. Veeva's entire leadership team is aware of the impact this has on your businesses and we are mobilizing all appropriate resources to both determine root cause and minimize the risk of future disruption.
Background: Vault Authentication Cluster
The Vault authentication cluster is a multi-region service (login.veevavault.com) that provides authentication to all vaults around the world. There are over 200 Vault PODs are located in 4 major Amazon regions (APAC, Europe, US East and US West). The PODs run the applications for all end users and API access. Authentication services support the verification of user credentials before passing that user along to their respective Vaults.
To optimize performance and provide redundancy, Vault’s authentication cluster is comprised for four different server groupings: a primary (controlling) server located in US East that supports both reading and writing to the authentication service as well as 3 secondary servers that are read-only and located in each of the other regions – APAC, Europe and US West.
The read-only servers are synchronized with the primary server and can act as fail-overs should hardware issues take the primary server offline. Having regionally accessible read-only authentication provides faster user experience when Vaults need to confirm active, valid sessions. The primary server needs to support read/write activities such as adding new users and processing all logins (where it not only validates the user’s credentials, but also writes login information to Vault’s login audit trail).
Outage Investigation, Findings & Corrective Actions
Over the last several days, Vault has experienced abnormal usage patterns that have bypassed our standard monitoring and service protection features. This has come with an obvious increase of traffic (amount of data sent and received), major spikes in connection usage and an inability to recover after the surge of connections without operational intervention.
It appears at this time that the abnormal usage was not caused by an increase in user traffic, but rather software bugs in the Veeva code that were causing too many calls to the authentication cluster in certain usage patterns. We have fixed two of the more significant bugs and will fix more over time. These code changes make the code more efficient, but they do not compromise the security in any way.
Next Steps
We will continue with our hypercare monitoring until the system has proven to be stable for many days. We will continue to identify and fix any appropriate software bugs that are contributing to the issue. And we will develop a longer term plan to address the root causes of how the software bugs were introduced in the first place. We will also formalize and share all findings in the Incident Report that includes root cause analysis. This report takes approximately 10 business days from the time we identify the root cause to being available to all customers.
FAQ
Did you consider failing over to your backup Authentication or rolling back the release?
Failover is not appropriate when the issue is traffic related. If we failover to new hardware, we will simply bring the problem to the new hardware. Given that we saw this pattern, in hindsight, in March, we did not feel it was directly attributable to the latest release.
Is it related to VeevaID?
No, it is not related to VeevaId. VeevaId has dedicated connection pools and volume of VeevaId requests are more easily measured and not yet significant.
Is this related to a DDoS attack? What are the cybersecurity and data integrity implications?
The reason that we believe that is not a cyber attack is that we are seeing the majority of traffic coming from authenticated processing or internal processing. There is not a high volume of direct access from anything other than expected sources. There is no impact whatsoever on data integrity or data loss stemming from this issue.
Were Vault Sandboxes taken offline and should we expect that going forward?
We had plans in place to take the sandboxes offline if the service issues continued on Thursday. However, the remediations we put in place were sufficient to restore the service without impact to the sandboxes. We do not have any future plans to take the sandboxes offline, nor do we see any such need at this time.
Outage Retrospective and Next Steps
The following is an update regarding the Vault outages that occurred on 22, 23, and 24 April 2024.
The Incident Report was made available on May 7 and details the changes implemented, particularly on the evening of 24 April, that led to Vault Auth stability. The Incident Report is available via Support or your account team.
The six areas of Corrective and Preventative Actions (CAPAs) tracked in Veeva’s QMS, covering the following areas:
VAULT AUTHENTICATION SERVICES ARCHITECTURE
VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE
CONTINUED INVESTIGATION
TEST ENHANCEMENTS
MONITORING AND LOGGING
VAULT CAPACITY PLANNING
The detailed work to size and scope the specific action is now underway, with a target of August 7th for a definitive set of plans. In addition to the formal CAPA process, Vault engineering is undergoing a complete code review to identify any other, similar, code inefficiencies. These changes will be available once fully tested. In addition, more awareness and oversight has been given to Vault application design and coding practices pertaining to the use of Vault’s Auth service.
For further details, please refer to the Incident Report.
We have updated our sub-processors to add Absorb Software Inc, Adstra LLC, Cadent LLC, and Throtle Inc. Absorb is used within Veeva CDMS to provide online training for end-users. Adstra LLC, Cadent LLC, and Throtle Inc. are used within Veeva Crossix for data linkage and data distribution. You may find our full list of sub-processors at trust.veeva.com.
What: CP QualityOne Mobile, QualityOne Audit Checklist Mobile, QualityOne Station Manager (iOS) and QualityOne Audit Checklist Android 23R3.0
- Visit the About the 23R3 Release page to review the schedule and other release details.
- Read about What's New in 23R3 release notes for in-depth information about the new features.
- Reference the Latest Announcements for release-specific information.
Scheduled Maintenance Closed.
The Consumer Products QualityOne Station Manager iOS and Audit Checklist Android apps are scheduled to be upgraded to the latest general release.
When: Tuesday, August 22 between 10:00 am PT
What: CP QualityOne Station Manager and Audit Checklist 23R2.0
Estimated Duration: 60 minutes
Learn More: - For more details, please read the 23R2 Release Notes to learn more about the new features.
What: CP QualityOne Station Manager and Audit Checklist 23R2.0
Estimated Duration: 60 minutes
- For more details, please read the 23R2 Release Notes to learn more about the new features.
Scheduled Maintenance Closed.
The Veeva QualityOne Station Manager app is scheduled to be upgraded to the latest general release.
When: Tuesday, May 02 at 10:00 am PT
What: Veeva QualityOne Station Manager 23R1.0
Estimated Duration: 60 minutes
For more details, please read the 23R1 Release Notes to learn more about the new features.
For updates regarding system availability, please refer to the Veeva Trust Site.
If you have further questions, please submit a ticket with Veeva Product Support.
Scheduled Maintenance closed.