Friday, July 9, 2010

Back to Basics – Troubleshooting Slow Applications (by Chris Greer)

Slow applications, who doesn’t have them from time to time? Even networks that are in tip-top shape can have issues in application performance. On many occasions when we start troubleshooting a slow application there is a push to immediately start looking in the SQL Traffic or SMB Calls or some other app traffic to determine the root cause of the issue. We find though that this approach takes us too deep, too fast. Assumptions are made about the network and the server itself that may throw us down the wrong path of troubleshooting, wasting time trying to blame the wrong component of the application.

When we start to troubleshoot a slow application, our first priority is to narrow down what bucket is causing the problem: the network, the client, or the server. Once we have identified where to place the blame, we can then get into the details to determine exactly what variable in that specific bucket is the root cause of the performance issue. Often, isolating the problem to one of these areas takes us back to the basics of the OSI model.

After we have successfully captured an example of the application running poorly (this is a VERY key part of the process – more articles on how to do this in the future…) we can use the rules of the transport layer to divide the OSI model in half, which helps us to place the blame on either the network or the server(s).

Osi
How can we do this at a packet level? We can look for two things: 1. TCP Retransmissions and 2. TCP Connection timers.


1. TCP Retransmissions

If these are observed in the trace file between client and server, then we can initially take the blame off the server for the cause of the performance problem. The reason we see TCP retransmissions is because the network is dropping packets somewhere along the path. This packet loss must first be tracked down and resolved then we can move forward with analyzing the server performance. With this one variable at the transport layer, we can cut the OSI model in half, leaving layers 5-7 alone, while focusing on the physical, data link, and network layers to see where traffic is being dropped.

How can we tell if there are retransmissions in the trace file? In Wireshark, use the Expert Info feature under the Analyze menu option. This window can be used to quickly see if there are TCP Retransmissions present in the trace file.


Wireshark



Note: Some client applications use TCP Keep-Alive packets to make sure that connections are not dropped between client and server. These may be interpreted by Wireshark as retransmissions. These are typically small packets that are repeated several times per minute. They are rarely a symptom of packet loss on the network.


2. TCP Connection Timers

After looking for retransmissions, we then analyze the connections between client and server to see if there is any delay at the transport layer when the handshake first happens. To do this, use the TCP Stream filter to isolate a TCP connection between client and server.


Tcp_handshake



Using the delta time column during the TCP handshake at the beginning of the connection, we can look for any delays in the connection setup time. If there are delays, these may be caused by packets being held up somewhere on the network, or by the server being slow in responding to the connection request. We may choose at this time to monitor the network between client and server, looking for excessively utilized links, or we may move our analyzer to the server to get a server-side view of the TCP connection setup. If the delay between SYN and SYN-ACK is still seen, then the server is holding up the show.

By starting with the transport layer when troubleshooting an issue, we can quickly cut the OSI model in half. If retransmissions are observed, we should focus on layers 1-3 to see where packets are being dropped, or if TCP connections are held up, we can focus on this layer on the server. Don’t get too deep in the application before ruling out the basics.
What do we do when we have ruled out the network? We will cover this in the workshop at Interop Las Vegas 2009 (shameless bait) or in future articles on LoveMyTool!



No comments:

Post a Comment