Veeva Support Locations
Veeva Vault PODs in the Asia Pacific and US West Coast Regions are currently experiencing intermittent degraded performance. Veeva engineering teams are working to return the service to normal as quickly as possible.
Affected Veeva Vault PODs are now fully available with all services operating normally.
Vault CDMS 24R1.0.33 HF is scheduled for production release on 07/25/2024 to CDMS PODs:
VV1-22, VV1-1090, VV2-2092, VV1-1127, VV1-1139, VV1-1148, VV1-1159, VV1-1161, VV1-1174, VV1-1175, VV1-1193, VV2-2142, VV3-3121, VV1-1194
The CDMS Utility services will not be available from 6:00 PM to 8:30 PM PST.
Scheduled Maintenance closed.
Outage Retrospective and Next Steps
The following is an update regarding the Vault outages that occurred on 22, 23, and 24 April 2024.
The Incident Report was made available on May 7 and details the changes implemented, particularly on the evening of 24 April, that led to Vault Auth stability. The Incident Report is available via Support or your account team.
The six areas of Corrective and Preventative Actions (CAPAs) tracked in Veeva’s QMS, covering the following areas:
VAULT AUTHENTICATION SERVICES ARCHITECTURE
VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE
CONTINUED INVESTIGATION
TEST ENHANCEMENTS
MONITORING AND LOGGING
VAULT CAPACITY PLANNING
The detailed work to size and scope the specific action is now underway, with a target of August 7th for a definitive set of plans. In addition to the formal CAPA process, Vault engineering is undergoing a complete code review to identify any other, similar, code inefficiencies. These changes will be available once fully tested. In addition, more awareness and oversight has been given to Vault application design and coding practices pertaining to the use of Vault’s Auth service.
For further details, please refer to the Incident Report.
Outage Retrospective (31 May 2024)
The following is an update regarding the Vault outages that occurred on 22, 23, and 24 April 2024.
Recall from the last update (15 May) that the Incident Report (IR) was made available on 7 May. The IR identified six areas of Corrective and Preventative Actions (CAPAs) tracked in Veeva’s QMS, covering the following areas:
VAULT AUTHENTICATION SERVICES ARCHITECTURE
VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE
CONTINUED INVESTIGATION
TEST ENHANCEMENTS
MONITORING AND LOGGING
VAULT CAPACITY PLANNING
To date there are 12 groupings of Development efforts applied across the 6 CAPAs. Each grouping (or "epic") encompasses a range of discrete efforts. Many efforts have been addressed. Many are currently being worked. Others have been or are being scoped and assigned.
To illustrate, one epic exists to reduce Vault Auth calls by caching operational metadata information. This particular epic pertains to the CAPA: VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE. There are 100 discrete work efforts within this epic. As of this update, 43 of the items have been addressed.
Many of the other epics are of a similar scale and composition.
The work done to categorize, size, assign, and work these items puts us in a good position to meet the 7 August due date to assess the ultimate completion dates of all discrete efforts representing a given CAPA.
This work is over and above the awareness and oversight given to Vault application design and coding practices pertaining to the use of Vault’s Auth service.
Outage Retrospective and Next Steps (12 August 2024)
The following is an update regarding actions taken by Vault teams to address the root causes of the Vault Auth outages that occurred on 22, 23, and 24 April 2024.
Recall that an Incident Report was provided on 7 May 2024 that detailed the changes implemented to that time, particularly on the evening of 24 April 2024, that led to Vault Authentication Services (“Vault Auth”) stability. (The Incident Report is accessible via your Veeva account team members.)
The Incident Report further outlined six Corrective and Preventative Actions (CAPAs) tracked in Veeva’s QMS, covering the following areas:
VAULT AUTHENTICATION SERVICES ARCHITECTURE
VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE
CONTINUED INVESTIGATION
TEST ENHANCEMENTS
MONITORING AND LOGGING
VAULT CAPACITY PLANNING
Each CAPA represents a significant effort on its own. In the months since the outage events, Vault teams have engaged in concentrated efforts to complete the CAPAs. While a target of 7 August 2024 was initially established to determine completion dates for the six CAPAs, and this has been accomplished, much work has gone into addressing each CAPA in addition to sizing and scoping remaining items. As a consequence of the work starting at the time of the event continuing unabated since, Vault Auth pressure has been significantly reduced and Vault Auth stability has been attained.
Remaining efforts now mostly entail complex architectural modifications requiring multiple releases to implement. The aim of these efforts is to allow for continued Vault platform growth and stability for years to come.
What follows is a summary of each CAPA with their respective anticipated release completion timeframes.
VAULT AUTHENTICATION SERVICES ARCHITECTURE
[DEV-714916] Enhancements to Vault Auth architecture
Target release for completion: 25R1
A key aspect of this CAPA is to offload Vault Auth of all but user authentication handling responsibility–a significant architectural change. This and other efforts represented by this CAPA are divided among 4 separate development threads. They encompass, in total, 45 individual work efforts, 22 of which have been closed. The remaining items require work efforts that span releases given their complexity and scope.
VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE
[DEV-714918] Enhancements to Vault application code and Vault infrastructure to reduce Vault Auth calls
Target release for completion: 24R3.3
This effort constitutes 8 major efforts. Combined, these efforts specify 326 individual work efforts, of which 284 are completed. Remaining efforts targeted for the 24R3.3 Limited Release will materialize in the 25R1 General Release.
CONTINUED INVESTIGATION
[DEV-714923] A comprehensive investigation continues into the outages
Target release for completion: 25R1
18 of the 20 tasks associated with this effort have been completed. The CAPA remains open to allow for the remaining two items to be accomplished in a deliberative manner.
TEST ENHANCEMENTS
[DEV-714924] Expand test processes to identify excessive Vault Auth accesses earlier in the development cycle
RESOLVED
This CAPA consisted of 3 discrete efforts that stress tested Vault Auth and led to scripts that can be reapplied in ongoing performance testing.
MONITORING AND LOGGING
[DEV-714926] Improve Vault monitoring tools to support more granular analysis when events occur
Target release for completion: 25R1
There are a total of 9 tickets constituting this CAPA, with 4 completed to date. Remaining items are architecturally complex and will take time to implement.
VAULT CAPACITY PLANNING
[DEV-714928] Ensure that expansion of Vault PODs and Vault instances accounts for impacts to Vault Auth
RESOLVED
This effort introduces enhancements to documentation and tracking of Vault POD and Vault capacity planning to reflect additional context into Vault POD and Vault capacity planning processes.
In summary, Veeva continues to harden Vault Auth while simultaneously reducing its scope of activities. The actions taken to date have led to gains in stability. Remaining actions build on this momentum and improve the scalability and reliability of Vault Auth to accommodate future expansion.
Vault Outage - Authentication Service
2024.04.25
Outage and Current Status
The recent issues with load on the authentication cluster resulted in the following outages:
Post Release 24R1: Monday, April 22, Tuesday, April 23, Wednesday, April 24
Post Infrastructure upgrade: Monday, March 25
As of April 25th, we have applied code fixes that appear to have stabilized the system. We will have full confidence once the service has run for a week or more without downtime. There was no security issue related to customer data, no loss of customer data, and the outage was not a result of a security attack by an outside party.
The authentication cluster (many virtual computers) is a central point of failure because it controls security access. Security of course is critically important. If the whole authentication cluster is down, all Vaults are down. The authentication cluster is designed to be redundant and highly available and historically has been until this recent outage.
In each outage case, the authentication cluster became overloaded and unable to process requests. The Veeva Engineering and Operations teams restored service by stopping the service, clearing the traffic and doing rolling, gradual restarts of the global Vault Points of Delivery (PODs). The restoration of service resulted in outages varying from 1-3 hours depending on the POD.
We take our commitment to delivering a reliable and performant service very seriously. With these outages, we recognize that we're falling short on that commitment. Veeva's entire leadership team is aware of the impact this has on your businesses and we are mobilizing all appropriate resources to both determine root cause and minimize the risk of future disruption.
Background: Vault Authentication Cluster
The Vault authentication cluster is a multi-region service (login.veevavault.com) that provides authentication to all vaults around the world. There are over 200 Vault PODs are located in 4 major Amazon regions (APAC, Europe, US East and US West). The PODs run the applications for all end users and API access. Authentication services support the verification of user credentials before passing that user along to their respective Vaults.
To optimize performance and provide redundancy, Vault’s authentication cluster is comprised for four different server groupings: a primary (controlling) server located in US East that supports both reading and writing to the authentication service as well as 3 secondary servers that are read-only and located in each of the other regions – APAC, Europe and US West.
The read-only servers are synchronized with the primary server and can act as fail-overs should hardware issues take the primary server offline. Having regionally accessible read-only authentication provides faster user experience when Vaults need to confirm active, valid sessions. The primary server needs to support read/write activities such as adding new users and processing all logins (where it not only validates the user’s credentials, but also writes login information to Vault’s login audit trail).
Outage Investigation, Findings & Corrective Actions
Over the last several days, Vault has experienced abnormal usage patterns that have bypassed our standard monitoring and service protection features. This has come with an obvious increase of traffic (amount of data sent and received), major spikes in connection usage and an inability to recover after the surge of connections without operational intervention.
It appears at this time that the abnormal usage was not caused by an increase in user traffic, but rather software bugs in the Veeva code that were causing too many calls to the authentication cluster in certain usage patterns. We have fixed two of the more significant bugs and will fix more over time. These code changes make the code more efficient, but they do not compromise the security in any way.
Next Steps
We will continue with our hypercare monitoring until the system has proven to be stable for many days. We will continue to identify and fix any appropriate software bugs that are contributing to the issue. And we will develop a longer term plan to address the root causes of how the software bugs were introduced in the first place. We will also formalize and share all findings in the Incident Report that includes root cause analysis. This report takes approximately 10 business days from the time we identify the root cause to being available to all customers.
FAQ
Did you consider failing over to your backup Authentication or rolling back the release?
Failover is not appropriate when the issue is traffic related. If we failover to new hardware, we will simply bring the problem to the new hardware. Given that we saw this pattern, in hindsight, in March, we did not feel it was directly attributable to the latest release.
Is it related to VeevaID?
No, it is not related to VeevaId. VeevaId has dedicated connection pools and volume of VeevaId requests are more easily measured and not yet significant.
Is this related to a DDoS attack? What are the cybersecurity and data integrity implications?
The reason that we believe that is not a cyber attack is that we are seeing the majority of traffic coming from authenticated processing or internal processing. There is not a high volume of direct access from anything other than expected sources. There is no impact whatsoever on data integrity or data loss stemming from this issue.
Were Vault Sandboxes taken offline and should we expect that going forward?
We had plans in place to take the sandboxes offline if the service issues continued on Thursday. However, the remediations we put in place were sufficient to restore the service without impact to the sandboxes. We do not have any future plans to take the sandboxes offline, nor do we see any such need at this time.
Outage Retrospective and Next Steps
The following is an update regarding the Vault outages that occurred on 22, 23, and 24 April 2024.
The Incident Report was made available on May 7 and details the changes implemented, particularly on the evening of 24 April, that led to Vault Auth stability. The Incident Report is available via Support or your account team.
The six areas of Corrective and Preventative Actions (CAPAs) tracked in Veeva’s QMS, covering the following areas:
VAULT AUTHENTICATION SERVICES ARCHITECTURE
VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE
CONTINUED INVESTIGATION
TEST ENHANCEMENTS
MONITORING AND LOGGING
VAULT CAPACITY PLANNING
The detailed work to size and scope the specific action is now underway, with a target of August 7th for a definitive set of plans. In addition to the formal CAPA process, Vault engineering is undergoing a complete code review to identify any other, similar, code inefficiencies. These changes will be available once fully tested. In addition, more awareness and oversight has been given to Vault application design and coding practices pertaining to the use of Vault’s Auth service.
For further details, please refer to the Incident Report.
We have updated our sub-processors to add Absorb Software Inc, Adstra LLC, Cadent LLC, and Throtle Inc. Absorb is used within Veeva CDMS to provide online training for end-users. Adstra LLC, Cadent LLC, and Throtle Inc. are used within Veeva Crossix for data linkage and data distribution. You may find our full list of sub-processors at trust.veeva.com.
Veeva CRM server maintenance is required for MCCP on CRM-30. During this time, Veeva CRM MCCP functionality will be unavailable for ORGs connected to CRM-30 POD for up to 60 minutes.
Scheduled Maintenance closed.
The trust portal https://trust.veeva.com will be upgraded on November 14th 2022 at 9am pacific time. There will be no impact to customers. No downtime is expected.
We are seeing errors with Video Transcoding. We are working to resolve the issue.
The problem has been resolved.
We are investigating the root cause.
All Chinese customers who are using China Link can not access their Vaults. It happened both on Production & Sandbox environment.
We are working to restore normal service!
Our circuit is down from China to Tokyo.
China Telecom network team is still working on the investigation!
Veeva Technology Team is investigating the issue with China Telecom and will provide updates as they become available.
China Telecom has identified a problem in Tokyo and working with their local provider on the resolution. Further updates will be provided as they become available.
China Telecom still working with their local provider in Tokyo on the restoration of the service. Further updates will be provided as those become available.
China Telecom still working with their local provider in Tokyo on the restoration of the service. Further updates will be provided as those become available.
China Telecom still working with their local provider in Tokyo on the restoration of the service. Further updates will be provided as those become available.
The issue has been resolved. The China Link is UP now.
From AWS Technical support:
2:11 AM PDT We are investigating an increase in DNS lookup failures from EC2 instances in a single Availability Zone in the US-EAST-1 Region.
This might cause intermittent issues for Products hosted in US-EAST-1
From AWS Support:
3:44 AM PDT We are implementing a mitigation to the increased DNS resolution errors from EC2 instances in a single Availability Zone in the US-EAST-1 Region, and are starting to see recovery.
We have been informed by AWS Support that this incident has now been resolved.
From AWS Support:
Between 1:22 AM and 4:13 AM PDT, customers experienced an increase in DNS resolution errors from EC2 instances in a single Availability Zone of the US-EAST-1 Region.
The issue has been resolved and all DNS queries are being answered normally.
© 2024 Veeva Trust Site