Problem scenario – Customer is manually recycling one rogue app pool 3 times a day to avoid site outages. Overall performance is slow for all web apps.
Research tells me that Perfmon and Debugdiag are the tools I need to use find out what is causing the 2 problems. Start the troubleshooting by determining where the bottleneck is coming from (disk, memory, processor, or network in that order), by using a few perfmon counters, then dig deeper with other counters and the process counters to figure out where the problem is coming from.
What is NP.NET or UDE
Priority questions to ask before you can even get started.
Are all web apps slow or is just one in a critical state? Answer: they are all slow.
When the system reaches failure, is it a crash or a hang? Answer: Hang
This is typically a hang: The app pool memory allocation incrementally reaches the limitation then hangs there with the processor at steady state or high with the site performance slow and eventually not responding but other web apps are ok. (Could be because of memory leak or maybe a logical deadlock or live lock, or may be infinite loop, etc. )
This is typically a crash: Memory allocation and processor are normal and a w3wp.exe process is killed affecting server side work, sites transactions are affected but the connection remains connected. IIS takes care to start up a new process to handle further requests – so IE will not see a disconnection for a crash. (IIS 6 isolation mode only)
When in critical state, is the memory allocation at maximum, processor running very high, a rogue process is killed and restarted and then processor goes back down to steady state? Answer: No.
When in critical state, is the processor running at 100%, then you kill a rogue process manually, it starts up again by its self, processor goes back down to steady state? Answer: No.
Or is there another thing that is happening when things become critical? Answer No.
Bugs and Debugging
From browser point of view, a crash and a hang on the server may prevent a complete HTTP response from being sent back, so both may look similar. Both are an unrecoverable event that resulted from some flaw/bug in the code that is executing. A bug is a logical flaw in the code. Which bug is causing this, now its time to look for debugging tools, so that NEXT crash can be avoided. You SHOULD NOT change any server settings to avoid crashes. You can set up debugging monitors and then WAIT for the next crash. So whenever failure occurs, debugging tools will record the event with the cause. This paragraph from here
Concepts – User Mode debugging and Managed code debugging, When a process runs, the OS Memory Manager maps the virtual addresses into physical addresses, where the data really exists. Using 32-bits gives Windows four gigabytes of virtual address space. Addresses in this range are identified with eight digits, hexadecimal numbers, which are referred to as DWORDs (short for double word). The full 32-bit address range consists of all addresses from 0x00000000 to 0xFFFFFFFF. Dump files (.dmp) capture these address spaces and as part of the debugging process we need to determine what piece of code is violating the address space. We need to take process dumps which is actually a mirror of what was happening in memory.
Debugging steps can be summarized as follows.
- Determine if Crash, Hang or just High Memory usage
- Check the application event log on Windows for Error Events
- Search the ULS Logs in SharePoint for errors, key in on the words unexpected, critical or ERROR.
- Isolate if WFE, App Server, Client, network/traffic, SQL
- Setup Perfmon and begin capturing activity per a pre-defined set of templates (from PAL)
- On the App Pool in IIS, clear all recycle setting
- Enable IIS logevent ) To set all the values to TRUE, run the following command: cscript adsutil.vbs Set w3svc/AppPools/DefaultAppPool/LogEventOnRecycle 255
- Use debugdiag to get crash dump (crash dump is not easy to get)
- Use debugdiag to get some hang dump on a specific number like 1600mb.
- Analyze dump files with DebugDiag, see if any customization is cause of issue
- Import the Perfmon files into PAL for further analysis
- If nothing shows run SQL Profiler
- Use Fiddler on the client browser to look for issues
- The following are some of the tools to use that will find the issue. Many detailed steps need to be followed by isolating components of the computer system and running detailed analysis and comparison to historical healthy system information.
- PerfMon is a performance recording product that ships on Microsoft Operating Systems. PerfMon is a powerful troubleshooting tool that is able to collect performance metrics over time, at specified intervals, and to generate a log or multiple logs that can be graphically analyzed to identify problem areas in your system performance. Understanding what metrics to collect and how to correctly configure this tool to collect them can be a challenge. If important counters are missing or incorrectly configured, the test will not help much.
- The best way to minimize errors and to capture the right metrics is to use counter templates that contain pre-defined counter selections and settings. These setup files should be based on the performance measures of an existing system that is known to be operating well and they can be customized to suit a particular situation or server configuration. For your convenience, the PAL tool, developed by Microsoft Services PFE community contains a tool to make the templates for you. The templates are put together by experts in the Field based on years of experience. See PAL section below on how to leverage the PerfMon tool to troubleshoot your issues.
- Download and read about PAL
- . PAL is a tool created by a Fellow PFE to assist in configuring PerfMon and analyzing PerfMon results. There is much skill and Knowledge required to effectively troubleshoot issues on complex products like Windows, SharePoint, SQL server and other pieces of the infrastructure such as hard drives, processors and RAM. The PAL tool greatly simplifies the process and provides a complete analysis of all the counters and metrics with graphs and verbal descriptions on how to interpret the information.
DebugDiag 1.2 (not 1.1!) – IIS Debug Diagnostic Tool – Super rich GUI. Very flexible, used for. hangs, performance problems, leaks, high cpu, exceptions, and crashes. install then right click create full user dump. Run the CrashHangAnalysis.asp script on a new .dmp while you open the same dump in windbg. ¨The SharePoint analysis script uses SharePointExt.dll, takes a long time to run, and is mainly useful for high memory problems. How to use the Debug Diagnostics tool to troubleshoot a process that has stopped responding in IIS. Symbol files may be needed to interpret dmp files, see APIs, etc.Get Symbol packages here
To debug a process crash, start by creating a crash rule against the process(s) in question. DebugDiag will attach to a specific process(es) and will monitor the process for one or more types of exceptions or any custom breakpoints that cause the process(es) to terminate unexpectedly. When the crash occurs, a full memory dump file will be created, in the directory specified when setting up the crash rule.
Process Hangs or Slow Performance
To debug a process hang, or slow performance use one of the following:
1. Create a performance rule. The performance rule could be based on Performance Counters or HTTP Response Times. This latter is specific to web servers or HTTP-based web services. The Performance Counters rule allows you to capture a series of consecutive user dumps when one or more Performance Counters exceed specified thresholds. The HTTP Response Times rule allows you to either use ETW (specific to IIS web server) or WinHTTP (to ‘ping’ any type of web server or HTTP-based web service) to capture user dumps when the configured timeout is reached.
2. Create a manual memory dump series during the slow or hang state by right-clicking the process name in the processes view and choosing the “Create Dump Series” option.
Then, analyze the resulting .dmp files with CrashHangAnalysis.asp and/or PerfAnalysis.asp (see below).
Memory or Handle Usage
To debug memory and handle usage, use one of the following:
1. Create a leak rule against the process in question. The leak monitoring feature will track memory allocations inside the process. Tracking is implemented by injecting a DLL (leaktrack.dll) into the specified process and monitoring memory allocations over time. When configuring a memory and handle leak rule, you can specify memory dump generation based on time or memory usage.
2. Using the “processes” view, right-click the process in question and select the “monitor for leaks” option. When the process has grown to the suspected problem size, manually dump the process by right-clicking on the same process in the processes view and choosing the “Create Full User dump” option.
If you want to check high memory issue , 3 dumps per minute is enough usually.
Analyzing Memory Dumps:
One of the most powerful features of DebugDiag is the ability to analyze memory dumps and generate a report file showing the analysis, along with recommendations to resolve identified problems. DebugDiag uses “Analysis Scripts” to analyze memory dumps. There are 5 analysis scripts shipped with DebugDiag 1.2 as follows:
x64 userdump analysis on x86 systems.
Installing x86 DebugDiag on x64 systems.
Installing DebugDiag 1.2 and 1.1 on the same system.
1.2 Memory leak analysis of 1.1 leaktrack.
Analysis of x86 Userdumps generated by x64 debugger.
- Microsoft Internet Information Services 6.0
- Microsoft Internet Information Services 7.0
- Microsoft Internet Information Services 7.5 From Microsoft Public Support Web site
Ever wondered which program has a particular file or directory open? Now you can find out. Process Explorer shows you information about which handles and DLLs processes have opened or loaded.
The Process Explorer display consists of two sub-windows. The top window always shows a list of the currently active processes, including the names of their owning accounts, whereas the information displayed in the bottom window depends on the mode that Process Explorer is in: if it is in handle mode you’ll see the handles that the process selected in the top window has opened; if Process Explorer is in DLL mode you’ll see the DLLs and memory-mapped files that the process has loaded. Process Explorer also has a powerful search capability that will quickly show you which processes have particular handles opened or DLLs loaded.
The unique capabilities of Process Explorer make it useful for tracking down DLL-version problems or handle leaks, and provide insight into the way Windows and applications work.. Download here the tool is from From Windows Sysinternals
Server Performance Advisor (SPA) (Win 2008 & Up)
How to Identify a Disk Performance Bottleneck Using the Microsoft – Microsoft Performance Monitor (perfmon) can gather performance counter data and Event Tracing for Windows (ETW) data, but it requires manual intervention to do the analysis. This is where the Microsoft Server Performance Advisor (SPA) picks up. The Microsoft Server Performance Advisor (SPA) tool collects performance data in the same manner as Performance Monitor. In addition, it analyzes the data and generates a detailed report on its findings.
Here is the disk related section of the SPA report:
- Hot Files: Files Causing Most Disk I\O: This section of the report identifies specific files which are causing the most disk I\O, the process involved, and the read/write bytes and IO’s per second.
- Disk Breakdown: Disk Totals: This section of the report identifies specific processes which are causing the most disk I\O on the physical disk.
Why Daily Application Pool Recycling
A performance interdependency exists between caching mechanisms, memory, and application pool recycling and operations. One of the keys to the interdependency is that although caching improves performance, Office SharePoint Server disables it if the front-end server detects low-memory conditions. These low-memory conditions occur when application pools use memory during typical tasks, including recycling. Using more than one application pool worsens the issue and interferes with page output caching.
The SharePoint Operations team addressed these interdependencies by using 64-bit hardware, and by scheduling application pool recycling for once a day during non-peak hours through a garbage collection/application recycling tool. On 32-bit hardware, the SharePoint Operations team addressed the interdependencies by setting the memory limit of the application pool to 1.4 GB and scheduling application pool recycling for once a day during non-peak hours.
Why do you need a Performance Baseline
Recommended metrics and counters for your Performance Baseline
You should regularly collect baseline performance data on your most essential servers. Doing so could save you a lot of pain and effort in proving or disproving an issue exists. Often at Microsoft support, customers ask us whether a certain performance statistic is “bad” or “good.” Although Microsoft provides performance guidelines, it’s difficult to know what’s typical for your environment unless you have baseline performance data. Creating baselines also helps you to focus on the problem at hand when there are other problems on the system. Certain performance statistics can be falsely blamed as the culprit for a new issue, but having a baseline from before the new problem occurred will help keep the focus off irrelevant statistics. From IT Pro Magazine Article
The most important counters and their settings to watch out for to ensure the health of your servers
CPU Counters :
Processor: % Processor Time: _Total. On the computer that is running SQL Server, this counter should be kept between 50 percent and 75 percent. In case of constant overloading, investigate whether there is abnormal process activity or if the server needs additional CPUs.
System: Processor Queue Length: (N/A). Monitor this counter to ensure that it remains below two times the number of core CPUs.
Processor:% Processor Time counter is high; sustained values in excess of 80% of the processor time per CPU are generally deemed to be a bottleneck.
Requests waiting with the SOS_SCHEDULER_YIELD wait type or a high number of run able tasks can indicate that run able threads are waiting to be scheduled and that there might be a CPU bottleneck on the processor. if you do see high CPU utilization, you can drill through to find the queries that are consuming the most resources.
Memory: Available Mbytes: (N/A). Monitor this counter to ensure that you maintain a level of at least 20 percent of the total physical RAM available.
Memory: Pages/sec: (N/A). Monitor this counter to ensure that it remains below 100.
For more information and memory troubleshooting methods, see SQL Server 2005 Monitoring Memory Usage (http://go.microsoft.com/fwlink/?LinkID=105585&clcid=0x409).
Monitor the following counters to ensure the health of your disks. The following values represent values measured monitor over a period of time, not during a sudden spike and not based on a single measurement.
· Logical Disk: Disk Transfers/sec. This counter provides the overall throughput on the specific disk. Use this counter to monitor growth trends and forecast appropriately.
· Logical Disk: Disk Read Bytes/sec & Disk Write Bytes/sec. This counter provides a measure of the total bandwidth for a particular disk.
· Logical Disk: Average Disk sec/Read (Read Latency). This counter indicates the time it takes the disk to retrieve data. On well-tuned I/O subsystems, ideal values are 1-5 ms for logs (ideally 1 ms on a cached array), and 4-20 ms for data (ideally below 10 ms). Higher latencies can occur in peak times, but if high values are occurring regularly, investigate the cause.
· Logical Disk: Average Disk sec/Write (Write Latency). This counter indicates the time it takes the disk to write the data. On well-tuned I/O subsystems, ideal values would be 1-5 ms for log (ideally 1 ms on a cached array), and 4-20 ms for data (ideally below 10 ms). Higher latencies can occur in peak times, but if high values are systematically occurring, investigate the cause.
· Logical Disk: Average Disk Byte/Read. This counter indicates the size of I/Os being read. This value may affect disk latency and larger I/Os may result in slightly higher latency. When used to monitor SQL Server, this tells you the average size of the I/Os SQL Server is issuing.
· Logical Disk: Average Disk Byte/Write. This counter indicates the size of I/Os being written. This value may affect disk latency and larger I/Os may result in slightly higher latency. When used to monitor SQL Server, this will tell you the average size of the I/Os SQL Server is issuing.
· Physical Disk: % Disk Time: DataDrive. Monitor this counter to ensure that it remains below two times the number of disks.
· Logical Disk: Current Disk Queue Length. For this counter, lower values are better. Values above 20 may indicate a bottleneck in the request waiting to be served by the disk, and should be investigated. Bottlenecks can create a backlog that may spread beyond the current server accessing the disk and result in long wait times for end users. Possible solutions to a bottleneck may be to add more disks to the RAID array, replace with faster disks, or move some of the data to other disks.
· Logical Disk: Average Disk Queue Length. This counter indicates the average number of outstanding I/O requests. The general rule is that you should be at two or fewer outstanding I/O requests per spindle, but this may be difficult to measure due to storage virtualization, differences in RAID levels between configurations. Look for higher than average disk queue lengths in combination with higher than average disk latencies. This combination could indicate that the storage array cache is being over utilized or that spindle sharing with other applications is affecting performance.
· Logical Disk: Average Disk Reads/Sec and Logical Disk: Average Disk Write/Sec. These counters indicate the rate of read and write operations on the disk. Monitor these counters to ensure that they remain below 85 percent of the disk capacity. Disk access time increases exponentially if reads or writes are more than 85 percent of disk capacity. To determine the specific I/O capacity for your hardware, refer to the vendor documentation, or use the SQLIO disk subsystem benchmark tool to calculate it. For more information, see SQLIO Disk Subsystem Benchmark Tool for Windows 2003. (http://go.microsoft.com/fwlink/?LinkID=105586&clcid=0x409).
· When you are using RAID configurations with the Logical Disk: Average Disk Reads/Sec or Logical Disk: Average Disk Write/Sec counters, use the formulas listed in the following table to determine the rate of I/Os on the disk.
|RAID 0||I/Os per disk = (reads + writes) / number of disks|
|RAID 1||I/Os per disk = [reads + (2 * writes)] / 2|
|RAID 5||I/Os per disk = [reads + (4 * writes)] / number of disks|
|RAID 10||I/Os per disk = [reads + (2 * writes)] / number of disks|
For example, if you have a RAID 1 system with two physical disks, and your counters are at the values shown in the following table:
|Average Disk Reads/Sec||80|
|Average Disk Write/Sec||70|
|Average Disk Queue Length||5|
The I/O value per disk can be calculated as follows:
(80 + (2 * 70))/2 = 110
The disk queue length can be calculated as follows:
5/2 = 2.5
In the situation above, you have a borderline I/O bottleneck.
Monitor disk latency and analyze trends. The amount of I/O and latency specific to SQL Server data files can be found by using the sys.dm_io_virtual_file_stats dynamic management view in SQL Server 2008. For more information, see sys.dm_io_virtual_file_stats. Example:
SELECT * FROM sys.dm_io_virtual_file_stats(DB_ID(), 2);
Tips and tricks
If you are viewing dissimilar items, scale the counters. For similar items, change the vertical scale of the graph.
Add specific counters, not entire objects, to your capture templates, in order to save space and server load
Use binary (.blg) file types, because they are more portable and capture new instances.
Watch your sample intervals. Use the “divide by 500” rule, and remember that you can schedule the captures.
If you are troubleshooting, capture locally, not over the network. Copy the .blg, and analyze it on your computer.
Start the troubleshooting by determining where the bottleneck is coming from (disk, memory, processor, or network in that order), by using the main counters listed below. Then dig in, by using the secondary counters and the process counters to figure out where the problem is coming from.
Look at the time of the capture, especially if you didn’t start it. If you’re troubleshooting a problem that happens at 8am, when people log on, but you’re looking at disk counters from 2am, when a backup and virus scan skews the data, you might waste time going down the wrong path.
Reboot the server before starting a Performance Monitor log and let the log run from reboot until the condition you’re tracking occurs