Update failures Jan 22nd, 2013

At some point during the night we started getting reports that European players were unable to get the latest version of the game. The update process fetches all of the updated content through a series of small HTTP fetches (like loading a web page full of images). We host the origin server with all the content ourselves and use a Content Delivery Network service to distribute these files around the world. Our CDN has many data-centers so that when you get content you get it from a local edge server that can deliver the content much faster than we can from our office.

As logs came in from frustrated players it became clear that one or more of the edge servers in Europe were malfunctioning. We aren’t sure what happened – at this point our CDN’s status logs suggest that everything was fine but clearly something was wrong. We looked at a number of logs sent in to the support desk and they all indicated that some content was loading fine but then some content would time-out and the update process would fail.

This morning we leased a VPN service that allowed us to simulate running the updater from around the world. We quickly tried updating from London, Helsinki, Luxembourg, Kiev, Singapore, Jerusalem and Warsaw. At first everything seemed to be working but then we noticed that there were intermittent time-outs that would cause the update process to halt and restart; this led us to discover the first problem with the system: if the content server stalled out we would give up on the update without giving the server a second chance.

The other thing we discovered was our CDN actually has 5 to 10 redundant servers in each region; the way we had the updater set up it would get assigned one of those servers at random and use it exclusively (if you were unlucky and got an overloaded server you would time out and then even if you retried you would get the same slow server again).

The build that is deployed right now addresses both problems: when we fetch content we balance our requests across all servers in the region; if a server times out we will switch to another server and retry – only after we’ve timed out from all servers in the region will we fail and start the process all over.

The system will continue to be improved and now that we have VPN sites around the world to test with we should be able to iron out the kinks without causing you guys so much grief!

Thank you for your patience as we learn how to scale Warframe up to meet the demand.

