TCP/IP KeepAlive, Session Timeout, RPC Timeout, Exchange, Outlook and you

Update June 21th, 2016 following feedback and a (true golden) blog post by the Exchange Team –ย Checklist for troubleshooting Outlook connectivity in Exchange 2013 and 2016 (on-premises) I’ve updated the recommended values for the timeout settings, and shortened the article overall for better reading. Do read the post in general, and in topic – check the CAS & Load Balancer configuration paragraphs.


Hi Again,

This post will spotlight networking considerations that are mostly overlooked. I’ve gathered a few of these issues that might brought you here searching for an answer:

  • Outlook is retrieving data from the Microsoft Exchange Server
  • The connection to Microsoft Exchange is unavailable. Outlook must be online or connected to complete this action
  • Sent items are stuck in Outbox or delayed
  • Outlook freezes or stuck when sending a message
  • Event ID 3033 regarding Exchange Server ActiveSync complaining about the most recent heartbeat intervals used by clients
  • Other strange / weird issues “but PING works! / telnet to the port works great!” – my personal favorite

The mentioned issues or symptoms could take place in any network environment, thus more common in complex network setups where multiple devices are protecting / route network traffic. Some typical configurations examples could be one of the following:

  • Outlook Anywhere or RPC over HTTP is being used, servers are protected or published by ISA / TMG / UAG / F5 / Juniper or any other reverse proxy / publishing solutions
  • Exchange servers are located behind a firewall, router or other network device
  • Clients / Remote clients are located behind a firewall, router or other network device (just to be clear on that…)
  • Exchange servers are being load-balanced with an external physical / virtual appliance

If you’ve read this post up until here and got disappointed because the above does not fit your issue, I’d like to suggest reviewing other RPC troubleshooting topics that might help Troubleshooting Outlook RPC dialog boxes – revisited or Outlook RPC Dialog Box Troubleshooting

Exchange Server traditionally (2000 to 2010) used MAPI over RPC to communicate “natively”, RPC is known to be “sensitive” and that’s why Exchange Server 2013 and beyond allows only Outlook Anywhere (RPC over HTTP) connections from clients which in my opinion is a great change that will simplify future deployments.

Client<>Server connections in general remains active while data “flows” , mails are sent/received etc. but when the connection is Idle, we might have a situation that it will be terminated. Here comes the term KeepAlive – a “dummy” packet that makes sure the connection remain active while no data is flowing and idle.

Here’s my “how-to” suggestion:

  • Configure the RPC timeout on Exchange servers to make sure that components which use RPC will trigger a keep alive signal within the time frame you would expect
    reg add "HKLM\Software\Policies\Microsoft\Windows NT\RPC" -v "MinimumConnectionTimeout" -t REG_DWORD -d 120
  • Consider modifying the server TCP/IP KeepAlive to reduce the chance of “IDLE” connections being terminated – (Default is Two hours – The recommended value is 30 minutes , and no less then 15 minutes) – this controls the OS TCP behavior with idle connections, could greatly improve responsiveness and scalability – http://support.microsoft.com/kb/314053/EN-US
  • Make sure that you are aware of any router, firewall or any other network device that is placed between your clients and your servers. Once you do – note their session timeout, session TTL or session ageing setting for the relevant protocol and port! (this could be tricky, so do not treat this lightly)

The trick for success here is that timeout settings should be configured without overlapping one another while following the client access “path” – for example – Client > FW > Load Balancer > Server:

  • FW timeout TCP/IP timeout – 40 minutes
  • Load Balancer – TCP/IP timeout – 35 minutes
  • Server – TCP/IP timeout – 30 minutes

If additional network devices are placed between the server and your clients, make sure that session timeout settings continue to be configured accordingly.
With today’s security measures, network security has become much more complex. A typical corporate network will implement many different network appliances or software based solutions to secure data, restrict access, prevent attacks and unwanted traffic.
Bottom line – don’t think you are done with network considerations just because “ping works” or an email comes with a statement like “your port is now open”.

I hope this post will benefit others as this issue was and will probably remain common with Exchange and other client / server services.

Don’t get timed out ๐Ÿ™‚
Ilantz

Additional useful links and sources of data:

26 thoughts on “TCP/IP KeepAlive, Session Timeout, RPC Timeout, Exchange, Outlook and you”

  1. Ilantz, so far I could kiss you. I just got done publishing a range of services through UAG for a multi-tenanted network (which means I break a few support rules on a daily basis and have to work things out on my own) and I was having a great deal of trouble getting Outlook Anywhere through UAG (I do suspect something is wrong with my UAG and plan to rebuild soon) but I finally got it working and Outlook was going offline every 2-3 minutes. I had no idea what it might be. Suspected some kind of timeout but where? RPC? TCPIP? UAG? Client? .. anyway I found your article and after implementing the first two I now have stable Outlook Anywhere to add to the list of completed services. Cheers mate.

  2. Hello Ilan

    That’s a very useful post there. I am a seasoned networker and firewaller, and more often than I like, I run into similar trouble with RPC running across firewalls and sysadmins complaining that RPC sometimes works, and sometimes it throws errors.

    The “second connection” which is negotiated during the first dialogue on TCP/135 (and subsequently allowed by the firewall, thanks to RPC inspection) goes into idle mode after a while, and 3600s later, the firewall clears it from its session table (default session timeout on a lot of firewalls is 3600s), without client or server being aware of this.

    When either client or server wants to re-use the “second connection”, the firewall drops the packets because there is no match in the session table. The packets are considered “out of state”.

    When I found your post, I saw a silver line on the horizon

    Four questions:

    1. Does your suggestion from above (reg add “HKLM\Software\Policies\Microsoft\Windows NT\RPC” -v “MinimumConnectionTimeout” -t REG_DWORD -d 120) force a KeepAlive/Hello Packet for *all* RPC communication, especially for the *second* connection? There’s actually no point in having it on the initial connection, it is short-lived anyway.

    2. Would we have to set the above config option on the RPC responder (usually the server), the initiator (usually the client), or both?

    3. Does RPC communication use TCPKeepAlives by default (and we just never see them at work because of the default TCPKeepAliveInterval of 7200s?). In that case, we’d only have to reduce TCPKeepAliveInterval.

    4. If (3) does not apply – would you know a configuration option for the RPC service that forces the use of TCPKeepAlives on *all* RPC communication, expecially for the “second connections”?

    If you can spare some time to give a “yes/no” style answer, I’d be quite happy.

    Thanks
    Marc

    1. Hi Marc,
      Great questions ! let’s see I can help…

      1. As far as I know – this affects only the RPC over HTTP proxy component.
      2. On the server only – the HTTP proxy server in our case the Exchange CAS server.
      3. Yes, they should. In fact TCPKeepAlives is recommended to be around 5 minutes (300,000 ms)
      4. I believe that by working the timeout from Large to small (FW>ROUTER>EXCHANGE) for example will render successful.

      Do not give up ๐Ÿ™‚ This issues are persistent and tend to ware people out.. stay strong.
      ilantz

      1. Hi Ilan
        Thanks for the reply – bare with me for taking so long to get back.

        That sounds really good! Especially your answer to question 3) makes me smile.

        If the secondary TCP conections actually do use TCP keepalives, then all we have to to is tweak the TCPKeepaliveTime.

        In fact, we’ve just re-observed one of our FWs starting to drop high-port connection packets again this week, and since this is a fresh green-field environment (where a small change won’t break business yet), we’ve been able to talk our Windows admins into reducing TCPkeepaliveTime to a reasonably low value for all systems in the environment.

        As soon as it’s implemented and we see some effects, we’ll let you know.

        best regards

        Marc

          1. Hi Marc, Ilan,

            Do you have updates once TCP KeepAlive were adjusted. Did it actually affected the RPC communication (second session) over the dynamic range ports.

  3. Hello!
    I’ve upgrade my mail system from exchange 2010 to Exchange 2013. I installed exchange 2013 parallel with Exchange 2010, after that I’ve migrate all mailbox from exchange 2010 to Exchange 2013. But I have some problem with Outlook. Some user said that their Outlook often hangs when they open mail or send mail.
    Could you help me! Thank you so much!

      1. Hi!
        I used RPC over HTTP for all users, they use Outlook 2010 and 2013 and use Cached Exchange Mode. This problem only appear after I migrate mailbox from EX10 to EX13.
        Help me plz!
        Thank you so much!

  4. This may have worked for me!! I have the exact same set up ad Hieu. Same clients, same set up. Only started happening when we migrated from EX10 to EX13. I added the new reg keys and so far (only been 10 – 15 minutes) its looking OK. Out of interest, what is the default RPC over HTTP timeout period? I know we have now set it to 300,000ms (5mins).

  5. Spoke too soon. Outlook just told me it was trying to connect. It comes back within 10 seconds. But it shouldn’t do it in the first place! arghhh

  6. Can this be set in the client side? Maybe change KeepAliveTime? We are running into the same issue, but we are hosted with Office365 so might be difficult to change settings on the Exchange Server.

  7. Great Article. I’ve got couple of questions

    1. I’ve changed my load balancer and Firewall time out to 4 hours. Is it worth changing the TCP/IP KeepAlive on CAS servers to 4 hours?

    1. Hi Mitesh,
      glad you found the info useful ๐Ÿ™‚
      Well, I believe the 4 hours will be sufficient.. I’m pretty sure the CAS servers will be much lower then that.
      Please do update back with your outputs !
      ilantz

  8. Hi,

    Thanks for such informative article :).
    I have the setup of 2 CAS servers and 2 MB servers (Exchange 2013) and using the Windows load balancer to manage the traffic between the 2 CAS servers. Now whenever I am trying to connect to the load balancer via Outlook Client, while checking names, it immediately throws an error as “The action cannot be completed. The connection to Microsoft is unavailable. Outlook must be online or connected to complete this action”.

    Can you please guide me as what issue in the setup is causing this problem? Just to add, nslookup to all the servers is working fine from the Outlook machine. The problem arises only when I try to create the Outlook profile via Windows load balancer.

    Thanks once again for your support ๐Ÿ™‚ ..Hoping to listen from you soon !!

    1. Exchange 2013 works a little different.
      You should not configure profiles manually, always use Autodiscover to do the work.
      In a nutshell, what you should do is configure Outlook Anywhere hostname to point to a DNS record which resolves to your NLB VIP.
      If you want to test the VIP functionality work with HOSTS, and redirect your client to a specific server etc..

      Read more about this in these links:

      Hope this helps
      ilantz

  9. Good article, but you should probably suggest people use higher timeouts. Probably a minimum of 30 minutes.

    This is because mobile devices leave the connection open and any timeout or keep alive makes them wake up and talk over the network thereby using battery. If you have a sufficiently large timeout it should just leave the connection active and only wake it up when there is actual data such as a new email.

  10. I’ve been dealing with a similar situation.

    Once we fixed the TCP Keep Alive Issues, we now started running into Connection Aborts at 15m and my HTTPERR Logs show lots of Connection Drops. I’m interested to know if you have run into anything around that.

    I’ve confirmed the timeouts on F5/Exchange/Network/Firewall

Leave a Reply

Your email address will not be published. Required fields are marked *