Vault Authentication Service Outage - Summary

Informational
May 15 2024,9:32am PDT

Vault Authentication Service Outage - Summary

Status: closed
Date: April 25 2024,12:00pm PDT
Affected Components:
Veeva CRM Veeva MultiChannel & Integrations Vault CRM Veeva Nitro Consumer Products Mobile Veeva Network Veeva CDMS Veeva Vault Veeva Align Vault CRM Align Veeva OpenData MyVeeva (ePRO, eConsent) Veeva SiteVault Veeva China SFA Veeva Crossix Veeva Compass Veeva Link Veeva Mobile Veeva RTSM Veeva Sub-processors update Veeva Support Locations MC-01 NITRO-US QualityOne OD-NA MyVeeva-US VDM1 MC-20 VA-US NITRO-EU QualityOne Audit Checklists OD-EU MyVeeva-EU Vault-US PODs SiteVault-US VCRM-US PROD CDMS-US MC-30 NITRO-AP QualityOne Station Manager VA-US-2 VDM2 MyVeeva-AP Vault-EU PODs SiteVault-EU VCRMA-US SBX CDMS-EU VDM3 ENGAGE-01 VA-US-3 Vault-AP PODs SiteVault-AP VCRM-EU PROD CDMS-AP WC-36 VDM5 VA-US-SBX VCRM-AP PROD VCRMA-EU PROD VA-EU VDM6 VDM20 VA-EU-2 VA-EU-3 VDM21 VA-EU-4 VDM22 VA-EU-5 VDM30 VA-EU-6 VDM40 SANDBOX VA-EU-SBX SANDBOX2 VA-AP SANDBOX3 VA-AP-SBX SANDBOX20 SANDBOX21 SANDBOX30 CRM-20 CRM-30 CRM-03 SMAR-CN VV2-5001 VV3-6001 VV1-1189 VV2-2145 CRM-04 VV1-1 VV2-10 VV3-33 cdb-1 cdb-101 VV1-1082 VV3-3122 VV2-2097 VV2-5000 VV1-4000 VV3-6000 VV1-1187 cdb-201 CRM-05 VV1-2 VV2-27 VV3-34 cdb-3 VV1-1110 VV2-2129 VV1-4001 VV2-2092 VV3-3121 CRM-06 VV1-3 VV2-28 VV3-3048 cdb-5 VV1-1111 VV2-2142 VV1-1191 CRM-08 VV1-4 VV2-29 VV3-3057 VV1-1160 cdb-7 CRM-10 VV1-5 VV2-30 VV3-3064 cdb-9 VV1-6 VV2-31 VV3-3086 cdb-11 VV1-7 VV2-32 VV3-3096 cdb-13 VV1-8 VV2-35 VV3-3099 cdb-15 VV1-9 VV2-41 VV3-3120 VV1-11 VV2-44 cdb-1002 VV3-3121 VV1-12 VV2-2047 VV1-13 VV2-2050 VV3-3123 VV1-17 VV1-14 VV2-2056 VV3-3124 VV1-22 VV1-15 VV2-2060 VV3-3125 VV1-37 VV1-16 VV2-2063 VV3-3127 VV1-1090 VV1-18 VV2-2070 VV3-3128 VV1-1127 VV1-19 VV2-2072 VV3-3129 VV1-1134 VV1-20 VV2-2075 VV1-1139 VV1-21 VV2-2080 VV1-1148 VV1-23 VV2-2083 VV1-1158 VV3-3131 VV1-24 VV2-2085 VV1-1159 VV1-25 VV2-2087 VV1-1161 VV1-26 VV2-2091 VV1-1174 VV1-38 VV1-1175 VV2-2092 VV1-39 VV2-2095 VV1-1193 VV1-40 VV1-42 VV2-2098 VV1-43 VV2-2120 VV1-1045 VV2-2121 VV1-1046 VV2-2122 VV1-1049 VV2-2124 VV1-1051 VV2-2125 VV1-1052 VV2-2126 VV1-1053 VV2-2127 VV1-1054 VV2-2128 VV1-1055 VV1-1058 VV2-2130 VV1-1061 VV2-2131 VV1-1062 VV2-2132 VV1-1065 VV2-2133 VV1-1066 VV2-2134 VV1-1067 VV2-2135 VV1-1068 VV2-2136 VV1-1069 VV2-2137 VV1-1073 VV2-2138 VV1-1074 VV2-2139 VV1-1076 VV2-2140 VV1-1077 VV2-2141 VV1-1078 VV2-2142 VV1-1079 VV2-2143 VV1-1081 VV1-1084 VV1-1088 VV1-1089 VV1-1094 VV1-1120 VV1-1121 VV1-1122 VV1-1124 VV1-1126 VV1-1128 VV1-1129 VV1-1130 VV1-1131 VV1-1132 VV1-1133 VV1-1135 VV1-1136 VV1-1137 VV1-1138 VV1-1140 VV1-1142 VV1-1144 VV1-1145 VV1-1146 VV1-1149 VV1-1150 VV1-1151 VV1-1152 VV1-1153 VV1-1154 VV1-1155 VV1-1156 VV1-1157 VV1-1163 VV1-1164 VV1-1165 VV1-1166 VV1-1168 VV1-1169 VV1-1170 VV1-1171 VV1-1172 VV1-1173 VV1-1176 VV1-1178 VV1-1180 VV1-1181 VV1-1183 VV1-1184 VV1-1185 VV1-1186 VV1-1190 VV1-1192 VV1-1193
Update

April 25 2024,12:00pm PDT

April 25 2024,12:00pm PDT

Vault Outage - Authentication Service

2024.04.25 

Outage and Current Status

The recent issues with load on the authentication cluster resulted in the following outages:

  • Post Release 24R1: Monday, April 22, Tuesday, April 23, Wednesday, April 24

  • Post Infrastructure upgrade:  Monday, March 25


As of April 25th, we have applied code fixes that appear to have stabilized the system.  We will have full confidence once the service has run for a week or more without downtime. There was no security issue related to customer data, no loss of customer data, and the outage was not a result of a security attack by an outside party.

The authentication cluster (many virtual computers) is a central point of failure because it controls security access. Security of course is critically important. If the whole authentication cluster is down, all Vaults are down. The authentication cluster is designed to be redundant and highly available and historically has been until this recent outage.

In each outage case, the authentication cluster became overloaded and unable to process requests.  The Veeva Engineering and Operations teams restored service by stopping the service, clearing the traffic and doing rolling, gradual restarts of the global Vault Points of Delivery (PODs).  The restoration of service resulted in outages varying from 1-3 hours depending on the POD. 

We take our commitment to delivering a reliable and performant service very seriously. With these outages, we recognize that we're falling short on that commitment. Veeva's entire leadership team is aware of the impact this has on your businesses and we are mobilizing all appropriate resources to both determine root cause and minimize the risk of future disruption.

Background: Vault Authentication Cluster

The Vault authentication cluster is a multi-region service (login.veevavault.com) that provides authentication to all vaults around the world.  There are over 200 Vault PODs are located in 4 major Amazon regions (APAC, Europe, US East and US West). The PODs run the applications for all end users and API access.  Authentication services support the verification of user credentials before passing that user along to their respective Vaults. 


To optimize performance and provide redundancy, Vault’s authentication cluster is comprised for four different server groupings:  a primary (controlling) server located in US East that supports both reading and writing to the authentication service as well as 3 secondary servers that are read-only and located in each of the other regions – APAC, Europe and US West.  

The read-only servers are synchronized with the primary server and can act as fail-overs should hardware issues take the primary server offline.   Having regionally accessible read-only authentication provides faster user experience when Vaults need to confirm active, valid sessions.  The primary server needs to support read/write activities such as adding new users and processing all logins (where it not only validates the user’s credentials, but also writes login information to Vault’s login audit trail).

Outage Investigation, Findings & Corrective Actions

Over the last several days, Vault has experienced abnormal usage patterns that have bypassed our standard monitoring and service protection features.   This has come with an obvious increase of traffic (amount of data sent and received), major spikes in connection usage and an inability to recover after the surge of connections without operational intervention.

It appears at this time that the abnormal usage was not caused by an increase in user traffic, but rather software bugs in the Veeva code that were causing too many calls to the authentication cluster in certain usage patterns. We have fixed two of the more significant bugs and will fix more over time.   These code changes make the code more efficient, but they do not compromise the security in any way.

Next Steps

We will continue with our hypercare monitoring until the system has proven to be stable for many days. We will continue to identify and fix any appropriate software bugs that are contributing to the issue. And we will develop a longer term plan to address the root causes of how the software bugs were introduced in the first place. We will also formalize and share all findings in the Incident Report that includes root cause analysis.  This report takes approximately 10 business days from the time we identify the root cause to being available to all customers.


FAQ

Did you consider failing over to your backup Authentication or rolling back the release?

Failover is not appropriate when the issue is traffic related.  If we failover to new hardware, we will simply bring the problem to the new hardware. Given that we saw this pattern, in hindsight, in March, we did not feel it was directly attributable to the latest release.  


Is it related to VeevaID?

No, it is not related to VeevaId.  VeevaId has dedicated connection pools and volume of VeevaId requests are more easily measured and not yet significant.


Is this related to a DDoS attack? What are the cybersecurity and data integrity implications?

The reason that we believe that is not a cyber attack is that we are seeing the majority of traffic coming from authenticated processing or internal processing. There is not a high volume of direct access from anything other than expected sources.  There is no impact whatsoever on data integrity or data loss stemming from this issue.


Were Vault Sandboxes taken offline and should we expect that going forward?

We had plans in place to take the sandboxes offline if the service issues continued on Thursday.  However, the remediations we put in place were sufficient to restore the service without impact to the sandboxes.  We do not have any future plans to take the sandboxes offline, nor do we see any such need at this time.


Incident Summary

May 15 2024,9:30am PDT

May 15 2024,9:30am PDT

Outage Retrospective and Next Steps


The following is an update regarding the Vault outages that occurred on 22, 23, and 24 April 2024. 


The Incident Report was made available on May 7 and details the changes implemented, particularly on the evening of 24 April, that led to Vault Auth stability. The Incident Report is available via Support or your account team.


The six areas of Corrective and Preventative Actions (CAPAs) tracked in Veeva’s QMS, covering the following areas: 


  • ​​VAULT AUTHENTICATION SERVICES ARCHITECTURE

  • ​​VAULT APPLICATION CODE AND VAULT INFRASTRUCTURE 

  • ​​CONTINUED INVESTIGATION 

  • ​TEST ENHANCEMENTS 

  • ​MONITORING AND LOGGING 

  • ​VAULT CAPACITY PLANNING 


The detailed work to size and scope the specific action is now underway, with a target of August 7th for a definitive set of plans. In addition to the formal CAPA process, Vault engineering is undergoing a complete code review to identify any other, similar, code inefficiencies. These changes will be available once fully tested. In addition, more awareness and oversight has been given to Vault application design and coding practices pertaining to the use of Vault’s Auth service.  


For further details, please refer to the Incident Report.