Jump to content

State of the TennoNet 2017


Recommended Posts

We have a saying at the office, "It's not my fault but it's my problem," and nowhere but networking does this phrase get used more. If you've been playing Warframe for long enough you'll have seen some pretty epic battles we've fought to work around troublesome networking equipment -- like that time when when we set up automatic proxy servers around the world to help people with difficult NAT setups. It's been quite an adventure and I'd like to take a moment to explain some of the changes we've made recently, what problems we encountered, what new problems we recently became aware of, and what we're going to do about it.

First let's go back to Christmas Eve: like the year before our network was under attack and we were neglecting our families to try to keep the game online. Bourbon helped, but so did a new network that we'd been setting up. We weren't quite ready to deploy it but after 4 attacks in 24 hours we decided to risk it and flipped the switch. We watched anxiously that night and spent the better part of our holiday break watching signs of trouble that never came -- the new network held!

5HfHvyD.png

The great thing about the new network is that it's faster: I could feel the interactions with our servers responding more quickly! I knew that on paper we'd cut a good amount of round-trip time off of most connections, especially for Europeans, but I was thrilled to notice that it actually felt better to me personally (this was particularly impressive during the heavy load of many, many Tenno playing on the holidays!)

Unfortunately not everyone was impressed. I began to hear a lot about the dreaded "Network Not Responding" (NNR) pop-up and began a lengthy investigation with many, many emails back and forth with our network partners.

YMVjjGB.jpg

Before I tell you what caused it let me explain what this warning means: whenever a request to our servers takes longer than 10 seconds to respond we show the NNR spinner to warn you that there's a problem. It could mean that our servers are being attacked, the hamsters have fallen off their wheels, or it's time for you to go kick your modem -- in any event we thought it would be good to let you know that there was something up so you know why things are taking so long.

The nice thing about NNR is that it only shows up for very specific services; you might be able to stay in your mission because the host for your squad might be unaffected; this gives you a chance to wait for the network to settle down before you extract to make sure we can save your progress. At least that was the idea until the new network started causing people to get this message for no reason and suddenly it wasn't particularly helpful any more.

It took about 3 weeks of diagnostics, conference calls, and all kinds of weird stress-tests before we figured out what was wrong. We even set up cloud servers around the world running tests that emulated how the game connects to us but nothing we did could reproduce the problem. It took the diligent and patient help from a number of players, especially Guides of the Lotus, to get us logs and telemetry from people who had the problem.

It turns out that the new network, apart from being faster, was much more generous with how long it would let the game stay connected. To reduce server load the old network had to enforce a limit on how long it would wait for clients to reuse an idle connection; since the new network is much, much bigger it can happily leave connections open so that when the game needs to talk to the server it can skip a bunch of work and just start talking right away.

There's just one problem with that: some networks will rudely drop connections that they feel have been quiet for too long. Maybe it's because they think you've crashed or maybe they're just overloaded, but either way, they would decide to violate the TCP protocol and just forget all about the connection without telling anyone. If the game went to talk to our servers on one of these "zombie" connections it would take a certain amount of time to realize that it was talking to a corpse before it buried it and reconnected; while it was waiting you'd see the annoying NNR pop-up.

The interesting thing about this behavior is that no amount of VPN-testing or cloud-hosted tests would enable us to reproduce it because the data centers that operate these services tend to have good networking equipment. Nobody at the office could reproduce the problem with their home networks either: the only information we had to work was was supplied by players!

It's not our fault, but it's our problem, and once we figured out what was going on it was easy to work-around: we just cut the persistent connection idle timeout back to match the old network (because they can't drop your connection if you close it first!). After making the change we anxiously waited for logs from the volunteers who had helped us and were pleased to see that the almost all of the new cases they had were from times when our servers were actually having problems (like on Friday we had a 20x spike in bandwidth on the minute Baro arrived but that's a story for another time!).

The trouble with asking for help with an issue like this is that even when you've fixed the main problem people keep sending you logs. In many cases we could correlate their NNR with their PC having trouble connecting to other services as well -- this is a sign of a local problem -- but it occurred to us that it might be something we could help with as well.

On Friday we added some new stats to track how stable your router's NAT is but before I show you those numbers let's start with an updated picture of the networks Tenno use to connect to warframe. The following data was collected for PC connections on Saturday, January 28th (the day after Baro Ki'Teer arrived).

BbiIY4t.png

We've seen the number of players suffering with Strict NAT stay roughly steady over the last year and if anything, the number has increased slightly. We suspect that this is because our automatic proxy service allows people to play even when their network connection isn't as cooperative as we would like.

In terms of how Tenno networks automatically forward ports I was surprised to see that over half of the connections trust their routers to just do the right thing (or had to do it manually, we can't tell):

SQnsCf8.png

I would have expected to see more cases support the configuration protocols but perhaps this is a sign of IPv6 adoption on the rise (carrier-grade NAT doesn't support them).

And finally, the graph that really blew our minds:

7axb1a7.png

Roughly one in twenty Tenno are on a network that will change their public address or their NAT port-mapping without warning! Imagine if your mobile carrier changed your phone number while you were having a call! "Sorry but you're going to have to call them back and to make it a challenge you have no idea what number to dial!"

As we dug into the numbers we found even more horrifying statistics: when your network wants to ruin your day it doesn't just do it once, it does it all the time: if you're one of these unlucky Tenno your network port will change on average 140 times per day and your address on average 5 times!

Luckily computers are pretty good at automating things and we should be able to handle some of these scenarios automatically. As time permits we'll be working on automatic workarounds for this totally crazy behavior because even though it isn't our fault, it's our problem.

Link to post
Share on other sites

Thanks for the details.  Interesting read!

16 minutes ago, [DE]Glen said:

Roughly one in twenty Tenno are on a network that will change their public address or their NAT port-mapping without warning!

Or a lot of tenno play using their mobile phone as a hot spot.  Either way that's ... appalling. 

Link to post
Share on other sites

The CS student in me found this very informative, I can only imagine the hell you guys went through, part of the fun in working in this field hides behind a big wall of torture I guess, and that'd be a gross understatement.

Link to post
Share on other sites

Guys I have avoided warframe for almost 4 years, in fact my last comment here was at the release of the grustag three. I guess alot of folks won't even remember them days. So as you can probably imagine I was completely blown away when I was finally placed in my liset, the way it orbits the planet or ship of my last mission, i see it circling on my star map. The immersion of it all is light years away from the game I came to enjoy and eventually forsake all those years ago for lack of meaningful content. I just wanted to go on record and thank the developers in particular for the cohesive storyline told by the quest chain of Stolen Dreams, Natah, The Second Dream and The War Within. It has been a much more fulfilling experience and have naturally opened my wallet in appreciation since returning in November. I don't care for the dismal RNG on alot of stuff, but for those few items that have several drop sources I am grateful since it allows me to change things up and still have a shot at the object I'm after, big plus. The fact that I've completely missed opportunities to get certain things that were either events or vaulted, without buying it off someone else or awaiting it to cycle in kiteers stock or some future unknown event is... more than a little disappointing.. but I bide my time. All in all very happy I came back. Game looks fanfreakin tastic. Thanks Devs.

Link to post
Share on other sites

Even though im not 100% sure i understanding this properly but that last part scares with the example with the cellphone. So telling me that my network/modem/router can be screwing me over with something so simple?! Thanks for the info though i really wish i knew more about networking to fix it myself x.x

Link to post
Share on other sites
6 hours ago, [DE]Glen said:

Luckily computers are pretty good at automating things and we should be able to handle some of these scenarios automatically. As time permits we'll be working on automatic workarounds for this totally crazy behavior because even though it isn't our fault, it's our problem.

The problem is that you use almost purely UDP. If you will step away from this dogma just a little bit and let's say, open 1 tcp keep-alive connection, you can watch it's status - there are a few mechanisms that allow you to know when tcp endpoint changes it's adress.

You can also send more important data via tcp tunnel, so there would be less desynch with server. Remember when reward appears in your inventory a few minutes late? Yeah these ones.

Edited by SonicSonedit
Link to post
Share on other sites
11 hours ago, [DE]Glen said:

-snip quote

This is great news. I wont bump a support ticket i put in... But i suspected something like this. That any time i left a hosted group or hosted, then disbanded. The game would just "forget" my ports and upnp. Glad you guys have working on this. 

 

 

Edit: One question i have. Why does Warframe have upnp and Nat-pmp Settings? This is the only game i have seen... use these settings. Do console players also get these option? I know by default console connect via upnp and Nat. (NOT nat-pmp)

Edited by Krhymez
Link to post
Share on other sites
16 hours ago, [DE]Glen said:

Roughly one in twenty Tenno are on a network that will change their public address or their NAT port-mapping without warning!

Question to Legal: Could you disclose a list of ISPs that do that so we (as their customers) could kick them in the bollocks? Or would that only get you into a lot of trouble?

Link to post
Share on other sites
9 hours ago, Krhymez said:

Edit: One question i have. Why does Warframe have upnp and Nat-pmp Settings? This is the only game i have seen... use these settings. Do console players also get these option? I know by default console connect via upnp and Nat. (NOT nat-pmp)

You mean why is Warframe awesome? I'm sure I can find a gif for that.

And that's two questions.

 

Link to post
Share on other sites
30 minutes ago, [DE]Glen said:

You mean why is Warframe awesome? I'm sure I can find a gif for that.

And that's two questions.

 

Well, i did start with one question in mind. :clem:

But i would like to know. Are those same options on console? Through trouble shooting this issue, i found a few threads about these setting. There was one post were it was suggested to disable both if you port forward.

Seems it was a post by you: 

From trying to "fix" issues i was having... i found that these settings were the issue. Even a few people in game that i have spoken to, have or had the same issues. In your statistics Nat-pmp was used very little. If that is the case, why is it enabled by default? Does it cause any issues to have it enabled? 

If find that it does.

 

Link to post
Share on other sites
Guest
This topic is now closed to further replies.
×
×
  • Create New...