StatusBacon

Auth0

A universal identity platform for customers, employees and partners.

Homepage: https://auth0.com
Status Page: https://status.auth0.com


Open Incidents

No open incidents

Scheduled Incidents

No scheduled incidents

Previous Incidents

Delay in processing search indexing

2017-01-23 19:46:00
Action Date Description
Resolved 2017-01-23 22:08:00 Indexing continues to function normally with no backlog.
Monitoring 2017-01-23 20:56:00 We have cleared the backlog, search indexing is now running normally.
Update 2017-01-23 20:10:00 We have doubled indexing capacity, and the backlog is now being cleared in the US region.
Identified 2017-01-23 19:45:00 Our service site currently has an abnormally high api call rate that is causing a delay of 15-30 minutes in the indexing of updated or new search information for users. We are adding capacity and monitoring. Authentication services are not affected and user search is functioning.

Temporary outage in EU for authentication

2017-01-03 14:01:35
Action Date Description
Resolved 2017-01-03 17:16:00 This incident has been resolved.
Monitoring 2017-01-03 14:38:00 The cause of the issue was a temporary loss of connectivity between our primary and secondary sites. This caused rapid failover and failback of between the primary and secondary site, which was the cause of the transient outage.
Investigating 2017-01-03 14:01:00 A temporary outage occurred in the EU region for three minutes. All services are currently operating normally. We are investigating root cause.

Auth0 MFA (Guardian) is unresponsive

2016-12-21 19:27:30
Action Date Description
Resolved 2016-12-21 19:48:00 Team has restored service to Guardian MFA across all regions.
Identified 2016-12-21 19:33:00 Access to Guardian MFA has been restored for US customers. Team continues to investigate EU/AU regions.
Investigating 2016-12-21 19:27:00 Guardian MFA service is currently unable to connect to our backend databases. Our response team is currently reverting code commits.

Log in with rules failing in EU preview

2016-12-15 01:36:39
Action Date Description
Resolved 2016-12-15 03:34:00 All services are running normally.
Update 2016-12-15 02:57:00 Database services have been restored, we are switched back to the EU database. Extensions and CRON are restored.
Monitoring 2016-12-15 02:51:00 The rules execution is now using the us webtask cluster and is responding. Extensions and cron jobs are not functioning.
Update 2016-12-15 02:39:00 AWS is reporting issues with DNS. This is affecting our database providers ability to provision.5:50 PM PST We are investigating DNS resolution issues from some instances in the EU-WEST-1 Region.6:29 PM PST We have identified the root cause of the DNS resolution issues in the EU-WEST-1 Region and continue working towards resolution.We are attempting to shift rules to another region.
Update 2016-12-15 02:03:00 There is an issue with the database cluster. We are switching to a new cluster.
Identified 2016-12-15 01:45:00 The rules cluster is not responding. Provisioning a new cluster.
Investigating 2016-12-15 01:33:00 Logins with rules are currently failing in the EU, investigating.

User indexing delay due to high load

2016-12-04 12:12:08
Action Date Description
Resolved 2016-12-03 16:13:00 We have completed adding capacity in US production. Search indexing is now operating normally.
Identified 2016-12-03 15:56:00 The time from which users are updated to when the updates are searchable is delayed by several minutes in the US production environment. The delay is caused by a demand surge. We are adding search indexing capacity.

Outage in Australia for a Subset of Subscribers

2016-11-19 18:23:35
Action Date Description
Postmortem 2016-11-19 18:23:00 On Nov 16 2016 from approximately 14:03 UTC to 14:53 UTC Auth0 had an outage for a subset of subscribers in Australia (AU) due to an expired certificate. During off hours larger subscribers were being moved to a separate PRODUCTION cluster in EU and AU as part of a remediation for a 8 minute outage on Nov 14th in PREVIEW environments. The AU cluster was setup and reporting healthy by the internal Auth0 probe, and by a third party external probe (Pingdom) at 13:58 UTC. At 14:03 UTC subscribers were moved to the PRODUCTION environment. All internal and external probes continued to report healthy. At 14:45 UTC a customer filed an urgent ticket that they were down due to an expired Auth0 certificate. At 14:53 UTC the certificate issue was resolved by temporarily moving a node from PREVIEW into rotation. I am truly sorry for the impact to the operations of our subscribers as a result of this outage. Let me take this opportunity to share what occurred, what we did wrong, what we learned, and how we will prevent this from happening in the future.## What HappenedAuth0 was moving a subset of subscribers to a new PRODUCTION environment from the PREVIEW environment. From our experience moving subscribers in the US and EU we believed that the move was simple and safe. The certificate wasn't correctly configured when the PRODUCTION environment was brought online. Internal probes do not check the TLS/SSL certificate, so they were all returning health responses. The external Pingdom probe, which runs over HTTPS, also did not alert on the certificate issue. We later discovered this is a limitation of Pingdom. From the perspective of the Auth0 service we had no indication of a failure. The first indication came when a subscriber filed a ticket 40 minutes after the switch. Because the problem and scope were realized late and fixed fast, the fix was completed before a status page incident was opened. Auth0 monitors certificate expirations through other means. A scheduled job runs once a day and uses https://www.ssllabs.com/ to check certificate expiration and also have calendar notifications. We had not yet set that scheduled job for this new environment. ## What were doing about it* Adding a certificate expiration check to internal probe. This will prevent a instance from going into rotation without a valid certificate.* Automating certificate deployment in the production environment. (In progress)* Adding a step to installing a new cluster procedure to run the SSL scheduled job and ensure the cluster passes before putting it online.## SummaryI realize how important it is that your users are able to connect to their applications, and how Auth0 is a core part of that ability. I take our commitment to reliability and transparency very seriously and regret letting you down.Thank you for your understanding and your continued support of Auth0.Chris KeyserDevOps Owner
Resolved 2016-11-16 14:53:00 An outage occurred when moving a subset of subscribers in the Australia region from the PREVIEW environment to the PRODUCTION environment during off hours on Nov 16 2016 at 14:03 UTC. The outage occurred due to an undetected expired certificate. External monitoring (Pingdom) failed to report the issue due to a limitation in their capabilities. The issue was first detected at 14:45 UTC and was fully resolved by 14:53 UTC. Please see the detailed post-mortem for more information.

Issues With Login Failures in US BETA, EU, and AU

2016-11-18 00:38:14
Action Date Description
Postmortem 2016-11-17 22:25:00 On November 14th at 21:11 UTC Auth0 experienced an outage of 8 minutes for the authentication API for US customers in the PREVIEW environment, for most customers in Europe and all customers in Australia. The PRODUCTION environment in the US region was unaffected. The outage was caused by an update to the Auth0 authentication code that contained an error in the caching strategy. I am truly sorry for the impact to the daily operations of our subscribers as a result of this outage. Downtime of any length is unacceptable to us. Let me take this opportunity to share what occurred, what we did wrong, what we learned, and how we will prevent this from happening in the future. ## What HappenedAn engineer applied security updates to the continuous integration (CI) server that manages build, test, and deploy within the Auth0 environment. During the security update, the QA server paused deployment due to test failures caused by the bug on the Authentication pipeline. After applying security updates and restarting the CI service, the engineer conferred with two additional engineers, and tested the security updates by manually applying the post-QA deployment job. This action effectively overrode the QA pause. Shortly thereafter, operations in all CI environments began to fail, and the change was immediately rolled back. The PREVIEW environment in the US and most customers in the EU and AU were affected. The engineers reviewed the build process and scripts for the post-QA deployment job, and missed that the last QA release rather than the latest successful QA release was deployed. They therefore thought it was a safe operation to perform. The logic instead deployed the last QA build which was failing tests. Normally the post-build process is automatically triggered after a successful QA build, and therefore works. The deploy process is not typically manually invoked, and the weakness in the post-deploy job logic that allowed a manual push of failing QA build was not identified. In the European and Australian regions subscribers have been deployed in a single environment that uses the CI update process, and have a single PRODUCTION monitoring probe for each. Last week we had started to execute on a plan to move some subscribers into a delayed deployment environment in Europe and Australia as well. The overall event lasted a total of 13 minutes, and within each region there was a maximum of 8 minutes of outage. The deploys run in sequence, the deploys for each region are not exactly aligned in time, which is why the overall event is longer than the maximum outage in any region. ## What were doing about it* We have updated the post-QA job to always pull the latest job that passed QA. (Done)* We have updated the post-QA job to make it explicit to override. (Done)* We are staggering in time deployments between PREVIEW Environments. (In Progress)* We accelerated moving larger subscribers in AU and EU into the PRODUCTION environment. (In Progress)* We have added probes for EU and AU PREVIEW and PRODUCTION environments. (Done)* Review rollback process and identify how to speed up time to recovery after a rollback is initiated. ## SummaryWe realize how important it is that your users are able to connect to their applications, and how Auth0 is a core part of that ability. We take our commitment to reliability and transparency very seriously and regret letting you down. Thank you for your understanding and your continued support of Auth0.Chris KeyserDevOps Owner
Resolved 2016-11-14 21:36:00 Services are operating normally. Our customers were impacted in US PREVIEW, EU, and AU environments.
Monitoring 2016-11-14 21:20:00 The issue has been resolved, and we are monitoring status.
Update 2016-11-14 21:14:00 The issue is identified and we are reverting the change.
Investigating 2016-11-14 21:07:00 We are currently seeing issues with logins in the US PREVIEW environment, EU, and AU. We are Investigating.

Small percentage of errors in login transactions

2016-11-07 22:41:18
Action Date Description
Resolved 2016-11-07 22:41:00 Resolving the incident. Things have been working normally for 40 minutes.
Monitoring 2016-11-07 22:00:00 Root cause for slow DB response times has been fixed. Monitoring logins
Identified 2016-11-07 21:55:00 Some database queries are slower than expected. We are working on fixing it
Investigating 2016-11-07 21:50:00 Our monitoring system has alerted of a few new failures
Monitoring 2016-11-07 21:35:00 Things are looking normal again, we are looking for the issues' root cause
Investigating 2016-11-07 21:30:00 We are currently investigating this issue.

Connectivity issues with our marketing website

2016-10-24 20:19:35
Action Date Description
Resolved 2016-10-21 22:32:00 We will continue to monitor.
Monitoring 2016-10-21 20:22:00 We have updated our website and docs to avoid the failing requests. Both components are back up
Investigating 2016-10-21 19:21:00 We are currently investigating this issue.Runtime, APIs and Dashboard are not affected.

Email delivery is delayed for some customers due to SMTP timeouts

2016-10-21 17:39:32
Action Date Description
Resolved 2016-10-21 18:36:00 Delivery has stabilized. We will continue to monitor.
Monitoring 2016-10-21 17:58:00 Delivery for the rest of our customers continues to look normal
Identified 2016-10-21 17:50:00 Some of our customers using custom SMTP servers are affected by this. Most emails are being delivered correctly.
Investigating 2016-10-21 17:38:00 We are currently investigating this issue.