For some years now we have been led to believe that The Cloud gives us a robust solution for providing software services (including GIS) which avoids the dangers of being dependent on individual servers, which risk loss of hardware, power supply, cooling and other points-of-failure. This is a solution has become increasing popular, with many organisations and services now dependent on it. In theory, the Cloud spreads the risk over thousands of individual servers, physically located in different data centres at different sites dispersed geographically across different countries and indeed continents.
Or that’s the theory. The 28th Feb saw a failure which brought down the US-EAST node of Amazon’s S3 service that has caused chaos across the web. Amazon’s web services (AWS) have grown from an infrastructure built to support their own online shopping business to become the largest of the cloud-hosting companies, underpinning around 150,000 web sites, services and smartphone apps around the world. These are used by literally millions of users on a daily basis. Amazon didn’t invent cloud computing, but they did commercialise it effectively and make it affordable. The US-EAST node is distributed across several large and anonymous warehouse-like buildings in Northern Virginia. Disruption has affected notable GIS services such as ArcGIS Online, several OpenStreetMap providers and Autodesk’s cloud through to well-known sites such as Netflix, Spotify, Instagram and IMDb, and even the NEST applications used by many to run their central heating and home security.
How many businesses and government applications are now dependent on maps and interactive services published through ArcGIS Online? ESRI are certainly concerned, having issued a rare global email to ArcGIS users explaining the situation and delaying service updates until they have taken “great care in insuring all services, maps and apps are working as they should”.
Amazon haven’t said exactly what went wrong, a software problems seems more likely than hardware, but the real problem seems to be that programmers have not taken the time to properly use the services which the Cloud provides to ensure reliability. Developers are supposed to spread their applications over different servers in different data centres so applications are resilient to localised outages. But this process is expensive and distributed programming is hard, so developers have fallen back into old, bad habits. The have relied on the Cloud only to scale the amount of processing available, but programmed their applications only for a single node. In Edinburgh, we have been promoting use of parallel and distributed processing since the early 1990s – but such applications are still not well-developed, especially within GIS.
The Amazon outage lasted over four hours, in the initial stages Amazon themselves weren’t able to use their service health dashboard because it too is hosted on AWS.
This incident will give pause for thought. Amazon need to review the dependencies between their nodes, but it is also reported that US-East is the most fragile component of the AWS cloud: it is old, running on old equipment in second-hand buildings. Reflection is also needed by developers who thought they were taking advantage of a highly-distributed resilient infrastructure but have found their businesses held hostage to a new and unexpected point-of-failure.