Vault Authentication Service Outage - Summary
Vault Authentication Service Outage - Summary
April 25 2024,12:00pm PDT
April 25 2024,12:00pm PDT
Vault Outage - Authentication Service
2024.04.25
Outage and Current Status
The recent issues with load on the authentication cluster resulted in the following outages:
Post Release 24R1: Monday, April 22, Tuesday, April 23, Wednesday, April 24
Post Infrastructure upgrade: Monday, March 25
As of April 25th, we have applied code fixes that appear to have stabilized the system. We will have full confidence once the service has run for a week or more without downtime. There was no security issue related to customer data, no loss of customer data, and the outage was not a result of a security attack by an outside party.
The authentication cluster (many virtual computers) is a central point of failure because it controls security access. Security of course is critically important. If the whole authentication cluster is down, all Vaults are down. The authentication cluster is designed to be redundant and highly available and historically has been until this recent outage.
In each outage case, the authentication cluster became overloaded and unable to process requests. The Veeva Engineering and Operations teams restored service by stopping the service, clearing the traffic and doing rolling, gradual restarts of the global Vault Points of Delivery (PODs). The restoration of service resulted in outages varying from 1-3 hours depending on the POD.
We take our commitment to delivering a reliable and performant service very seriously. With these outages, we recognize that we're falling short on that commitment. Veeva's entire leadership team is aware of the impact this has on your businesses and we are mobilizing all appropriate resources to both determine root cause and minimize the risk of future disruption.
Background: Vault Authentication Cluster
The Vault authentication cluster is a multi-region service (login.veevavault.com) that provides authentication to all vaults around the world. There are over 200 Vault PODs are located in 4 major Amazon regions (APAC, Europe, US East and US West). The PODs run the applications for all end users and API access. Authentication services support the verification of user credentials before passing that user along to their respective Vaults.
To optimize performance and provide redundancy, Vault’s authentication cluster is comprised for four different server groupings: a primary (controlling) server located in US East that supports both reading and writing to the authentication service as well as 3 secondary servers that are read-only and located in each of the other regions – APAC, Europe and US West.
The read-only servers are synchronized with the primary server and can act as fail-overs should hardware issues take the primary server offline. Having regionally accessible read-only authentication provides faster user experience when Vaults need to confirm active, valid sessions. The primary server needs to support read/write activities such as adding new users and processing all logins (where it not only validates the user’s credentials, but also writes login information to Vault’s login audit trail).
Outage Investigation, Findings & Corrective Actions
Over the last several days, Vault has experienced abnormal usage patterns that have bypassed our standard monitoring and service protection features. This has come with an obvious increase of traffic (amount of data sent and received), major spikes in connection usage and an inability to recover after the surge of connections without operational intervention.
It appears at this time that the abnormal usage was not caused by an increase in user traffic, but rather software bugs in the Veeva code that were causing too many calls to the authentication cluster in certain usage patterns. We have fixed two of the more significant bugs and will fix more over time. These code changes make the code more efficient, but they do not compromise the security in any way.
Next Steps
We will continue with our hypercare monitoring until the system has proven to be stable for many days. We will continue to identify and fix any appropriate software bugs that are contributing to the issue. And we will develop a longer term plan to address the root causes of how the software bugs were introduced in the first place. We will also formalize and share all findings in the Incident Report that includes root cause analysis. This report takes approximately 10 business days from the time we identify the root cause to being available to all customers.
FAQ
Did you consider failing over to your backup Authentication or rolling back the release?
Failover is not appropriate when the issue is traffic related. If we failover to new hardware, we will simply bring the problem to the new hardware. Given that we saw this pattern, in hindsight, in March, we did not feel it was directly attributable to the latest release.
Is it related to VeevaID?
No, it is not related to VeevaId. VeevaId has dedicated connection pools and volume of VeevaId requests are more easily measured and not yet significant.
Is this related to a DDoS attack? What are the cybersecurity and data integrity implications?
The reason that we believe that is not a cyber attack is that we are seeing the majority of traffic coming from authenticated processing or internal processing. There is not a high volume of direct access from anything other than expected sources. There is no impact whatsoever on data integrity or data loss stemming from this issue.
Were Vault Sandboxes taken offline and should we expect that going forward?
We had plans in place to take the sandboxes offline if the service issues continued on Thursday. However, the remediations we put in place were sufficient to restore the service without impact to the sandboxes. We do not have any future plans to take the sandboxes offline, nor do we see any such need at this time.
May 15 2024,9:30am PDT
May 15 2024,9:30am PDT
Outage Retrospective and Next Steps
The following is an update regarding the Vault outages that occurred on 22, 23, and 24 April 2024.
The Incident Report was made available on May 7 and details the changes implemented, particularly on the evening of 24 April, that led to Vault Auth stability. The Incident Report is available via Support or your account team.
The six areas of Corrective and Preventative Actions (CAPAs) tracked in Veeva’s QMS, covering the following areas:
VAULT AUTHENTICATION SERVICES ARCHITECTURE
VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE
CONTINUED INVESTIGATION
TEST ENHANCEMENTS
MONITORING AND LOGGING
VAULT CAPACITY PLANNING
The detailed work to size and scope the specific action is now underway, with a target of August 7th for a definitive set of plans. In addition to the formal CAPA process, Vault engineering is undergoing a complete code review to identify any other, similar, code inefficiencies. These changes will be available once fully tested. In addition, more awareness and oversight has been given to Vault application design and coding practices pertaining to the use of Vault’s Auth service.
For further details, please refer to the Incident Report.
© 2024 Veeva Trust Site