Set up a Performance Test – office web app performance issue

The servers they specify those are the published numbers at: http://technet.microsoft.com/en-us/library/ff431682.aspx

Here is the Test Settings

We are starting with 10 users, and stepping up 10 additional users every 5 minutes with 30 second wait times.
1 requests every 3 seconds is what we begin with this setup, then we go up to 3-4 requests /sec at about  25 minutes. First couple requests always fail even after warm up. (Warm up in our terms is visiting some sites and viewing ppt’s in the browser successfully) Stepping up to 20 users starts getting Either network connectivity has been lost or the server is too busy to service your request. Check your network connection and try the request later. errors.  And soon after that we start getting :  The service is temporarily unavailable. Please try again later.

In 30 minutes, we are sending a total of 5000 PPT viewer requests.

The pptx docs are not really large, between 250 KB and 1 MB, but the resultant cache files are between 2
MB – 6 MB.

Using the max cached size for calculation, 5K docs X 6 MB = 30 GB cache files are  needed in 30
minutes.

Which comes to about 1 GB/minute, and 17 MB/sec. The content database sizes for cache sites and disk
io monitoring figures confirms the figures.  I first isolated the ppt app and word app to their own app pool, for monitoring purposes. We smiled after  observing a big improvement, but our test results were not interpreted good, to that smile vanished  quickly.

Considering the temp files on the disk, I closely monitored the disk IO on the app server and on the SQL.
No disk queues are observed, disk write/sec values are below critical thresholds, nowhere close to 5
milliseconds. Disk transfers/sec validates the findings, which means this is some serious IO, but I cannot collect any data that proves there is latency on the disks, they seem to catch up fine.

I changed the recovery model for the cache databases to simple, and fixed the Autogrowth, now they are
pre-sized, no improvement.  Then, I have been playing with the two settings for viewing with no luck : Workerprocesscount (Currently = 10), MaxConversionsperworker ( Currently = 5 )
They were 3 to 5 respectively at the beginning, I changed the workerprocesscount to 24 and  maxconversions to 10, did not change much at all. Cache size is 25 GB, not the default 100 GB.

Additionally, what I observe is; after I stop the load test, the temp files are still being generated in C:\Windows\Temp\powerpointcache\6f5bbecd-8d38-4e51-a78c-493cb88b2bde\viewing\ folder for a
while.

I have available processor time and memory on both SQL and app server. No errors in the event logs.
Bumped up office web app monitoring, but nothing striking in the ULS.  WFE is sitting idle.
Disk IO does not seem to be at the red line according to PAL report. Since I saw high cpu on some of the SQL processors during the test, I changed the MaxDOP to 1 as well, with no improvements or decrease observed.
The only interesting thing I see is VERY high context switches per second on SQL I still think disk IO problem may be hiding behind the counters, so if you have any tips and tricks there, please share.  I am doing the standard monitoring for disk IO and using PAL to analyze the results.

One guy had a case with Unisys where 2010 crawl just would not complete because SQL would just stop
responding. They had maxdop set to 0, and what we would see in SQL activity monitor is about a million
threads all trying to run at once, but nothing actually running. We set the maxdop down to 4 (because I
didn’t know about the 1 recommendation then) and it fixed the problem. Completely. Amazingly.
Astoundingly. The Search system had been completely unreliable before making the change, but from
that point forward, the Search never once reproduced any of the various problems it was having, which
included components going offline randomly and not being able to change the topology reliably.
So, for those service app databases, I would highly recommend NOT having it set to 0, but as I said, I’m
going with the recommendation of 1, regardless of my experience.

Not sure how the guy fixed it, but nice test case

Advertisements