server-density
Homepage: https://www.serverdensity.com/
Status Page: http://status.serverdensity.com/
Action | Date | Description |
---|---|---|
Resolved | 2016-12-30 12:27:00 | This incident is now resolved.We are considering this a new occurrence of the 23rd incident http://status.serverdensity.com/incidents/2wt5tffd150t, on which we are still actively working with our provider and will provide a detail postmortem when this work is completed. |
Monitoring | 2016-12-30 09:10:00 | We have now restored capacity on our metrics processing cluster. Some devices will show a metrics gap between 08:20 and 08:51 which will even out.We'll continue to monitor this closely. |
Identified | 2016-12-30 08:46:00 | We're experiencing a slowdown in displaying metrics graphs on dashboards. The issue is identified and we expect to recover soon.Alerting is not affected. |
Action | Date | Description |
---|---|---|
Scheduled | 2017-12-23 15:25:00 | On January 3rd 2017 between 12:00-16:00 UTC we will be deploying an improvement to our alert notification code that may prevent open group alerts from closing automatically at PagerDuty. This update changes the way that alerts are identified to allow for more granularity of group alert notifications, however, this means that any open group alerts at the time of the deployment will not have the same id passed in the fixed message which will prevent the alerts from being closed automatically. Affected users will need to manually close/resolve affected alerts at PagerDuty. Alerts triggered after the deployment of this update will close automatically, as expected, once the alert closes at Server Density.Change: Improvement of group alert notification identification. Affected Users: PagerDuty notification users.Expected Impact: Once the deployment is completed any group alerts that were open at PagerDuty at the time of the deployment will not be automatically closed once the alert has been closed at Server Density. Affected users will need to manually close/resolve the alerts at PagerDuty. |
Completed | 2017-01-09 15:35:00 | This maintenance was completed without issue. |
Scheduled | 2017-01-03 12:00:00 | On January 3rd 2017 between 12:00-16:00 UTC we will be deploying an improvement to our alert notification code that may prevent open group alerts from closing automatically at PagerDuty. This update changes the way that alerts are identified to allow for more granularity of group alert notifications, however, this means that any open group alerts at the time of the deployment will not have the same id passed in the fixed message which will prevent the alerts from being closed automatically. Affected users will need to manually close/resolve affected alerts at PagerDuty. Alerts triggered after the deployment of this update will close automatically, as expected, once the alert closes at Server Density.Change: Improvement of group alert notification identification. Affected Users: PagerDuty notification users.Expected Impact: Once the deployment is completed any group alerts that were open at PagerDuty at the time of the deployment will not be automatically closed once the alert has been closed at Server Density. Affected users will need to manually close/resolve the alerts at PagerDuty. |
Scheduled | 2016-12-23 15:25:00 | On January 3rd 2017 between 12:00-16:00 UTC we will be deploying an improvement to our alert notification code that may prevent open group alerts from closing automatically at PagerDuty. This update changes the way that alerts are identified to allow for more granularity of group alert notifications, however, this means that any open group alerts at the time of the deployment will not have the same id passed in the fixed message which will prevent the alerts from being closed automatically. Affected users will need to manually close/resolve affected alerts at PagerDuty. Alerts triggered after the deployment of this update will close automatically, as expected, once the alert closes at Server Density.Change: Improvement of group alert notification identification. Affected Users: PagerDuty notification users.Expected Impact: Once the deployment is completed any group alerts that were open at PagerDuty at the time of the deployment will not be automatically closed once the alert has been closed at Server Density. Affected users will need to manually close/resolve the alerts at PagerDuty. |
Action | Date | Description |
---|---|---|
Resolved | 2016-12-23 11:48:00 | Our provider has confirmed the cause for the outage and is not expecting to have a re-occurrence. As such we are considering this resolved and will publish our provider root cause together with our own postmortem. |
Monitoring | 2016-12-22 19:37:00 | We now have restored redundancy and have contacted our provider about the initial failure. We'll continue to monitor this closely. |
Identified | 2016-12-22 19:04:00 | We have identified the cause for the second cluster failure and recovered from it - graphs and metrics collection are functional again.Now we will be determining the cause for the first and initial cluster failure. |
Update | 2016-12-22 18:45:00 | Another metrics cluster is down so the metrics write is still affected. |
Investigating | 2016-12-22 18:19:00 | We suffer a downtime in one of our metrics clusters which make us to have gaps in graphs starting at 17:41 UTC.Alerts processing were not affected. |
Action | Date | Description |
---|---|---|
Resolved | 2016-12-23 11:45:00 | This incident has been resolved. |
Identified | 2016-12-16 09:55:00 | Both our redundant nodes at USA, New York monitoring location are unreachable.We have received the following notification from our provider:"Our efforts to revive node5 have not been successful. A spare server is currently being shipped to replace the faulty one."Until we are able to access one node again, this location checks will not take place. We recommend setting up 3 monitoring locations per service check to avoid relying on a single location, not being the desired case we have North Virginia closer to NYC that may be used instead.https://support.serverdensity.com/hc/en-us/articles/201091476-Monitoring-node-locations-and-IP-addresses |
Action | Date | Description |
---|---|---|
Completed | 2016-12-15 11:30:00 | The scheduled maintenance has been completed. |
Scheduled | 2016-12-15 11:15:00 | Per request of our hosting provider, we are changing one of our Manchester monitoring location nodes IP address from 185.120.34.52 to 185.157.232.52. Customers that use this location and have whitelisted the previous address will need to update their rules. The tests issued from that location will continue to work as only one of redundant pair has been changed. |
Scheduled | 2016-12-15 11:04:00 | Per request of our hosting provider, we are changing one of our Manchester monitoring location nodes IP address from 185.120.34.52 to 185.157.232.52. Customers that use this location and have whitelisted the previous address will need to update their rules. The tests issued from that location will continue to work as only one of redundant pair has been changed. |
Action | Date | Description |
---|---|---|
Resolved | 2016-12-03 22:46:00 | The incident has been resolved. |
Monitoring | 2016-12-03 22:19:00 | We have enabled no data alerts and are monitoring the recovery. |
Identified | 2016-12-03 22:10:00 | We've identified the cause as one failed DB node which has been taken out of rotation, payload processing is normalizing. We will be turning no data alerts on in some minutes. |
Update | 2016-12-03 21:20:00 | We're continuing to resolve the issue with our payload processing clusters. Alerts are being processed at a slower rate, we are still keeping no data off. |
Investigating | 2016-12-03 20:56:00 | We're looking into an issue with processing payloads. Alerts may trigger slower than expected or not at all while this incident is ongoing and we have disabled no data alerts as a precaution. |
Action | Date | Description |
---|---|---|
Completed | 2016-11-19 13:14:00 | This maintenance is now complete.The impact was: graphs did not receive updates between 08:04 and 08:08 and will show that gap until it evens out in 3 days. The UI was inaccessible for a total of 8 minutes between 09:32 and 09:35 UTC and between 11:05 and 11:10. |
Verifying | 2016-11-19 11:33:00 | The third and last batch of this maintenance is complete. UI was inaccessible between 11:05 and 11:10 UTC. |
Update | 2016-11-19 09:49:00 | The second batch of this maintenance is complete. UI was inaccessible between 09:32 and 09:35 UTC. The next batch will start at 11:00 UTC. |
Update | 2016-11-19 08:26:00 | The first batch of this maintenance is complete. Graphs did not receive updates between 08:04 and 08:08 UTC. The next batch will start at 09:30 UTC. |
Scheduled | 2016-11-19 08:00:00 | Our infrastructure provider has been notified of a potential vulnerability affecting our Virtual Servers. The remediation will require maintenance to the hypervisor nodes and a reboot of all the virtual server instances on those nodes.Given our redundancy, historically these reboots have just caused a few minutes of downtime while the rebooted nodes recover. We're scheduling a wider maintenance window to account for all the reboots across the entire fleet, however we don't expect more than a few minutes during this time window. |
Action | Date | Description |
---|---|---|
Identified | 2016-11-13 00:22:00 | A network issue degraded payloads processing, no data protection caused some false positives |
Resolved | 2016-11-13 00:22:00 | Network issue is now solved and everything is back to normal |
Investigating | 2016-11-12 23:38:00 | We are seeing a decrease in processed payloads, we have disabled no data alerts while investigating |
Action | Date | Description |
---|---|---|
Postmortem | 2016-11-29 18:34:00 | On November 8, between 16:18 and 17:15 UTC we experienced elevated latency to and from one of our database clusters primary server which caused elevated response times on payload processing and alerting.We run our MongoDB clusters using a replica set configuration. This allows to quickly failover to a secondary database server when the primary fails. Unfortunately, this wasn't a complete failure as the elevated latency caused the secondaries to fall behind on operations replicating from the primary. Our monitoring picking up secondaries behind allowed us to quickly identify the root cause of the issue but, by that time, with all secondaries behind, meant that we no longer had the option of a fast failover. We then decided to pause the secondaries replication to alleviate the traffic and shortly after we saw improved latency measurements. Once the network latency returned to normal we proceeded to synchronize and resume the secondaries replicas.We have been working with our infrastructure provider Softlayer for the last 2 weeks but they are still to pinpoint the actual cause. Another update will be made once we have a root cause analysis from Softlayer. |
Resolved | 2016-11-09 16:04:00 | We have not observed high latency in the affected cluster in the last 16h. All systems have been running normally.We're still working with our provider to get a root cause for this event and we'll share it when we get it. |
Monitoring | 2016-11-08 22:56:00 | We have now completed the synchronization of the affected database cluster secondaries and will continue to monitor the network latency that caused this issue for the next hours. |
Update | 2016-11-08 18:22:00 | Although the measured latency is still showing high variance, we are starting to see mostly normal values. As such device nodata alerts have been enabled. Server Density is operating normally again now.We will continue to work with our provider until full recovery as well as work on catching up the affected database cluster secondaries. |
Update | 2016-11-08 17:16:00 | We are measuring improved latency now but still higher than normal. This will result in normal service except for the device nodata alerts that we are keeping disabled until we measure normal latency again. |
Update | 2016-11-08 17:02:00 | The database cluster slowdown as been identified as caused by high network latency. We are now working with our provider to restore normal service.We have identified the following service impact:- device payload processing is slower and gaps may show on graphs- alerts may be up to 5 minutes delayed- device nodata alerts are disabled- api calls may ocasionaly return 500 errors |
Identified | 2016-11-08 16:33:00 | We're experiencing a slowdown on one of our database clusters. We're working to restore service responsiveness. |
Investigating | 2016-11-08 16:00:00 | We're currently investigating a slowdown in device payload processing. |
Action | Date | Description |
---|---|---|
Postmortem | 2016-11-29 11:59:00 | On November 3, between 12:25 and 12:32 UTC, during a planned maintenance operation on one of our MongoDB clusters, an election took place earlier than expected causing one server still in maintenance to assume primary. As the service clusters had not yet been configured to allow access to this server, the UI dashboards didn't load. This election was manually reverted a couple of minutes later and service restored.We have reviewed and updated this particular maintenance procedure to prevent this from happening again. |
Resolved | 2016-11-03 15:52:00 | We have now forced a new election on this cluster to confirm it working as expected. A postmortem with this event will be published in the next days. |
Update | 2016-11-03 12:58:00 | Only those devices that received a provisioning update and a new agent key (using puppet, chef, ansible) during the election showed missing data. |
Update | 2016-11-03 12:47:00 | Some devices are showing missing data. We're continuing to investigate. |
Identified | 2016-11-03 12:32:00 | One of our MongoDB clusters was in mid-election and taking more time than usual. Graphs on dashboards are loading now and we'll soon confirm if there's any other underlying issue. |
Investigating | 2016-11-03 12:27:00 | We're investigating a slowdown in loading dashboard graphs. |