A universal identity platform for customers, employees and partners.
Homepage: https://auth0.com
Status Page: https://status.auth0.com
Action | Date | Description |
---|---|---|
Resolved | 2017-01-23 22:08:00 | Indexing continues to function normally with no backlog. |
Monitoring | 2017-01-23 20:56:00 | We have cleared the backlog, search indexing is now running normally. |
Update | 2017-01-23 20:10:00 | We have doubled indexing capacity, and the backlog is now being cleared in the US region. |
Identified | 2017-01-23 19:45:00 | Our service site currently has an abnormally high api call rate that is causing a delay of 15-30 minutes in the indexing of updated or new search information for users. We are adding capacity and monitoring. Authentication services are not affected and user search is functioning. |
Action | Date | Description |
---|---|---|
Resolved | 2017-01-03 17:16:00 | This incident has been resolved. |
Monitoring | 2017-01-03 14:38:00 | The cause of the issue was a temporary loss of connectivity between our primary and secondary sites. This caused rapid failover and failback of between the primary and secondary site, which was the cause of the transient outage. |
Investigating | 2017-01-03 14:01:00 | A temporary outage occurred in the EU region for three minutes. All services are currently operating normally. We are investigating root cause. |
Action | Date | Description |
---|---|---|
Resolved | 2016-12-21 19:48:00 | Team has restored service to Guardian MFA across all regions. |
Identified | 2016-12-21 19:33:00 | Access to Guardian MFA has been restored for US customers. Team continues to investigate EU/AU regions. |
Investigating | 2016-12-21 19:27:00 | Guardian MFA service is currently unable to connect to our backend databases. Our response team is currently reverting code commits. |
Action | Date | Description |
---|---|---|
Resolved | 2016-12-15 03:34:00 | All services are running normally. |
Update | 2016-12-15 02:57:00 | Database services have been restored, we are switched back to the EU database. Extensions and CRON are restored. |
Monitoring | 2016-12-15 02:51:00 | The rules execution is now using the us webtask cluster and is responding. Extensions and cron jobs are not functioning. |
Update | 2016-12-15 02:39:00 | AWS is reporting issues with DNS. This is affecting our database providers ability to provision.5:50 PM PST We are investigating DNS resolution issues from some instances in the EU-WEST-1 Region.6:29 PM PST We have identified the root cause of the DNS resolution issues in the EU-WEST-1 Region and continue working towards resolution.We are attempting to shift rules to another region. |
Update | 2016-12-15 02:03:00 | There is an issue with the database cluster. We are switching to a new cluster. |
Identified | 2016-12-15 01:45:00 | The rules cluster is not responding. Provisioning a new cluster. |
Investigating | 2016-12-15 01:33:00 | Logins with rules are currently failing in the EU, investigating. |
Action | Date | Description |
---|---|---|
Resolved | 2016-12-03 16:13:00 | We have completed adding capacity in US production. Search indexing is now operating normally. |
Identified | 2016-12-03 15:56:00 | The time from which users are updated to when the updates are searchable is delayed by several minutes in the US production environment. The delay is caused by a demand surge. We are adding search indexing capacity. |
Action | Date | Description |
---|---|---|
Postmortem | 2016-11-19 18:23:00 | On Nov 16 2016 from approximately 14:03 UTC to 14:53 UTC Auth0 had an outage for a subset of subscribers in Australia (AU) due to an expired certificate. During off hours larger subscribers were being moved to a separate PRODUCTION cluster in EU and AU as part of a remediation for a 8 minute outage on Nov 14th in PREVIEW environments. The AU cluster was setup and reporting healthy by the internal Auth0 probe, and by a third party external probe (Pingdom) at 13:58 UTC. At 14:03 UTC subscribers were moved to the PRODUCTION environment. All internal and external probes continued to report healthy. At 14:45 UTC a customer filed an urgent ticket that they were down due to an expired Auth0 certificate. At 14:53 UTC the certificate issue was resolved by temporarily moving a node from PREVIEW into rotation. I am truly sorry for the impact to the operations of our subscribers as a result of this outage. Let me take this opportunity to share what occurred, what we did wrong, what we learned, and how we will prevent this from happening in the future.## What HappenedAuth0 was moving a subset of subscribers to a new PRODUCTION environment from the PREVIEW environment. From our experience moving subscribers in the US and EU we believed that the move was simple and safe. The certificate wasn't correctly configured when the PRODUCTION environment was brought online. Internal probes do not check the TLS/SSL certificate, so they were all returning health responses. The external Pingdom probe, which runs over HTTPS, also did not alert on the certificate issue. We later discovered this is a limitation of Pingdom. From the perspective of the Auth0 service we had no indication of a failure. The first indication came when a subscriber filed a ticket 40 minutes after the switch. Because the problem and scope were realized late and fixed fast, the fix was completed before a status page incident was opened. Auth0 monitors certificate expirations through other means. A scheduled job runs once a day and uses https://www.ssllabs.com/ to check certificate expiration and also have calendar notifications. We had not yet set that scheduled job for this new environment. ## What were doing about it* Adding a certificate expiration check to internal probe. This will prevent a instance from going into rotation without a valid certificate.* Automating certificate deployment in the production environment. (In progress)* Adding a step to installing a new cluster procedure to run the SSL scheduled job and ensure the cluster passes before putting it online.## SummaryI realize how important it is that your users are able to connect to their applications, and how Auth0 is a core part of that ability. I take our commitment to reliability and transparency very seriously and regret letting you down.Thank you for your understanding and your continued support of Auth0.Chris KeyserDevOps Owner |
Resolved | 2016-11-16 14:53:00 | An outage occurred when moving a subset of subscribers in the Australia region from the PREVIEW environment to the PRODUCTION environment during off hours on Nov 16 2016 at 14:03 UTC. The outage occurred due to an undetected expired certificate. External monitoring (Pingdom) failed to report the issue due to a limitation in their capabilities. The issue was first detected at 14:45 UTC and was fully resolved by 14:53 UTC. Please see the detailed post-mortem for more information. |
Action | Date | Description |
---|---|---|
Postmortem | 2016-11-17 22:25:00 | On November 14th at 21:11 UTC Auth0 experienced an outage of 8 minutes for the authentication API for US customers in the PREVIEW environment, for most customers in Europe and all customers in Australia. The PRODUCTION environment in the US region was unaffected. The outage was caused by an update to the Auth0 authentication code that contained an error in the caching strategy. I am truly sorry for the impact to the daily operations of our subscribers as a result of this outage. Downtime of any length is unacceptable to us. Let me take this opportunity to share what occurred, what we did wrong, what we learned, and how we will prevent this from happening in the future. ## What HappenedAn engineer applied security updates to the continuous integration (CI) server that manages build, test, and deploy within the Auth0 environment. During the security update, the QA server paused deployment due to test failures caused by the bug on the Authentication pipeline. After applying security updates and restarting the CI service, the engineer conferred with two additional engineers, and tested the security updates by manually applying the post-QA deployment job. This action effectively overrode the QA pause. Shortly thereafter, operations in all CI environments began to fail, and the change was immediately rolled back. The PREVIEW environment in the US and most customers in the EU and AU were affected. The engineers reviewed the build process and scripts for the post-QA deployment job, and missed that the last QA release rather than the latest successful QA release was deployed. They therefore thought it was a safe operation to perform. The logic instead deployed the last QA build which was failing tests. Normally the post-build process is automatically triggered after a successful QA build, and therefore works. The deploy process is not typically manually invoked, and the weakness in the post-deploy job logic that allowed a manual push of failing QA build was not identified. In the European and Australian regions subscribers have been deployed in a single environment that uses the CI update process, and have a single PRODUCTION monitoring probe for each. Last week we had started to execute on a plan to move some subscribers into a delayed deployment environment in Europe and Australia as well. The overall event lasted a total of 13 minutes, and within each region there was a maximum of 8 minutes of outage. The deploys run in sequence, the deploys for each region are not exactly aligned in time, which is why the overall event is longer than the maximum outage in any region. ## What were doing about it* We have updated the post-QA job to always pull the latest job that passed QA. (Done)* We have updated the post-QA job to make it explicit to override. (Done)* We are staggering in time deployments between PREVIEW Environments. (In Progress)* We accelerated moving larger subscribers in AU and EU into the PRODUCTION environment. (In Progress)* We have added probes for EU and AU PREVIEW and PRODUCTION environments. (Done)* Review rollback process and identify how to speed up time to recovery after a rollback is initiated. ## SummaryWe realize how important it is that your users are able to connect to their applications, and how Auth0 is a core part of that ability. We take our commitment to reliability and transparency very seriously and regret letting you down. Thank you for your understanding and your continued support of Auth0.Chris KeyserDevOps Owner |
Resolved | 2016-11-14 21:36:00 | Services are operating normally. Our customers were impacted in US PREVIEW, EU, and AU environments. |
Monitoring | 2016-11-14 21:20:00 | The issue has been resolved, and we are monitoring status. |
Update | 2016-11-14 21:14:00 | The issue is identified and we are reverting the change. |
Investigating | 2016-11-14 21:07:00 | We are currently seeing issues with logins in the US PREVIEW environment, EU, and AU. We are Investigating. |
Action | Date | Description |
---|---|---|
Resolved | 2016-11-07 22:41:00 | Resolving the incident. Things have been working normally for 40 minutes. |
Monitoring | 2016-11-07 22:00:00 | Root cause for slow DB response times has been fixed. Monitoring logins |
Identified | 2016-11-07 21:55:00 | Some database queries are slower than expected. We are working on fixing it |
Investigating | 2016-11-07 21:50:00 | Our monitoring system has alerted of a few new failures |
Monitoring | 2016-11-07 21:35:00 | Things are looking normal again, we are looking for the issues' root cause |
Investigating | 2016-11-07 21:30:00 | We are currently investigating this issue. |
Action | Date | Description |
---|---|---|
Resolved | 2016-10-21 22:32:00 | We will continue to monitor. |
Monitoring | 2016-10-21 20:22:00 | We have updated our website and docs to avoid the failing requests. Both components are back up |
Investigating | 2016-10-21 19:21:00 | We are currently investigating this issue.Runtime, APIs and Dashboard are not affected. |
Action | Date | Description |
---|---|---|
Resolved | 2016-10-21 18:36:00 | Delivery has stabilized. We will continue to monitor. |
Monitoring | 2016-10-21 17:58:00 | Delivery for the rest of our customers continues to look normal |
Identified | 2016-10-21 17:50:00 | Some of our customers using custom SMTP servers are affected by this. Most emails are being delivered correctly. |
Investigating | 2016-10-21 17:38:00 | We are currently investigating this issue. |