Commit f4562e12 authored by dwentzel's avatar dwentzel

initial commit

This utility captures a series of performance metrics and logs them to various tables. For more information see my blog series at:
How to Install Performance Collector
Initial deployment and upgrades are handled by the same Powershell script: Deploy-PerformanceCollector.ps1. There is one parameter: $servername. It assumes you are running as an NT acct with sysadmin permissions.
Installation Process
Start|Run|cmd.exe (Run As Administrator)
cd /d <folder that contains .ps1 file>
powershell .\Deploy-PerformanceCollector.ps1 "ServerName"
Signed Scripts Error
You may see this error when running Powershell scripts:
File Deploy-PerformanceCollector.ps1 cannot be loaded because the execution of scripts is disabled on this system.
In this case run the following and then run the .ps1 file again.
set-executionpolicy unrestricted
How Does Performance Collector Work?
Everything is installed in the PerformanceCollector schema within the PerformanceCollector database. There are 2 sqlagent jobs that do the performance collection:
1. PerformanceCollector.WaitAndBlockMonitor: runs in a "WHILE 1=1" loop and will rerun every 5 mins if it is stopped. Within the loop, every 15 seconds (configurable in PerformanceCollector.Config) it will collect and blocking and waiting data to PerformanceCollector.BlockAndWaitMonitor.
2. PerformanceCollector.RunAddIns: runs every 5 mins to determine what AddIns are scheduled to run. Discussed more in the next section.
How do the AddIns Work?
There are a series of procedures named PerformanceCollector.Addin*. PerformanceCollector.RunAddIns is tickled every 5 minutes from ServiceBroker. This proc builds a cursor over ANY entry in PerformanceCollector.Config that is enabled and runs it.
The AddIns procedures are responsible for determining when the given AddIn is actually supposed to run (every 8 hours, once a day, etc). It determines this from querying PerformanceCollector.Config. It then calls the actual worker proc if it needs to.
The Most Important Object: PerformanceCollector.BlockAndWaitMonitor
This is a view that shows you any process that is waiting or blocking/blocked and the cause. This is snapshotted from various DMVs every 15 seconds. In general people will complain about application performance by saying, "The app was slow from 1:45 to 2:00 today." In that case start by querying this view by filtering by CurrTime. You'll likely find the cause of the problem.
Manually Snapping a Spid
You can manually run EXEC PerformanceCollector.WaitAndBlockInspector at any time and it will log current waiting and blocking data. This may be useful if you notice your query is taking longer than it should. You can also pass a @spid parameter like this "PerformanceCollector.WaitAndBlockInspector @spid = 100;" and it will show you any waiting and blocking as well as any activity for the spid. This data is output to the screen and logged in the table.
Other Objects
Abandoned Spids
This view will also show "abandoned spids." An abandoned spid is any spid with an open transaction for at least 60 seconds that is blocking something else. Here is the most important aspect...the spid is sitting on AWAITING COMMAND. That is what differentiates an abandoned spid from an ordinary blocking spid. When the spid is sitting on AWAITING COMMAND that means the SQL Server is waiting for another command from the spid to do something...most likely issue a COMMIT TRAN.
I have never seen a case where the spid will come back and issue the COMMIT. In every case this is a data access framework that thinks it is NOT in implicit_transaction mode and SQL Server thinks it is. For instance, various combinations of jdbc and websphere produce this behavior because jdbc does not issue sp_reset_connection commands like .NET does.
An abandoned spid is first logged after 60 seconds. After 5 minutes of being in an abandoned state, and after another spid has been blocked by it for 60 seconds, the spid is killed. It is logged to PerformanceCollector.SpidsKilled.
Where do I find Other Interesting Performance Metrics?
These are the tables that you can query. Everything is under PerformanceCollector schema. Each table is structured so the first columns are CurrTime and ServerName. This aids in troubleshooting. You can also aggregate all of this data across multiple customers/instances to do trending.
every 7 hours logs single-use ad-hoc query plans that are a total waste of plan cache space. If this is really big then the solution is to run 'optimize for ad hoc workloads' which reclaims all of the memory and makes sure that ad hoc plans are not full cached. This is a "analyze once" script. Once you've enabled 'optimize for ad hoc' then you are done. But we run this in case a new framework (ruby maybe) is deployed which causes a radical increase in ad hoc plans. You can also use the text to determine where you are not doing parameterized queries correctly.
this is a view on top of BlockAndWaitStats that decodes a lot of data. This is the main table you want to use. Generally there sill be a report of "slowness" or blocking or whatever at a given time. Query this view first and it will give you a good idea as to which procs were running and what they were blocking/blocked by, waiting on, etc.
Gathers the buffer stats (counts and size per object per db).
You can configure certain AddIns to NOT run or change their schedule here.
shows you all deadlocks that have occurred and tries to decode the XML to show the Victim and Survivor and what they were doing. Please note that there is a lag time between when a deadlock occurs and when it is decoded here that could be AT MOST 8 hours. You can manually run the Deadlock AddIns to do the decoding at any time if you really need to see up-to-the-minute deadlock data.
collected every 4 hours via Addin_IndexMetrics. Gets the amount of time each index is spent in a blocking/waiting state.
Not yet implemented. This will read the missing indexes from the query plans instead of relying on the DMVs. This seems to be far more accurate. This will list the TotalCost (somewhat nebulous) as well as the table and index columns that the optimize believes would be beneficial.
Not yet implemented. An aggregate of the the Dtls table that quickly shows the best indexes to be created.
lists the IO stalls by database and filetype.
snaps the sys.dm_db_missing_index* DMVs. However, there is a better way to do this that looks just at cached query plans. Even so, this will show every egregious examples of missing indexes.
Not yet implemented. See notes below.
Not yet implemented. See notes below.
logs the PLEs for the instance. This runs every 5 mins.
Gathers instance-level perf counters in EAV format for aggregation later.
every 7 hours (via AddIn_PlanCache) logs stored proc stats (reads/writes/cpu/duration) by Metric. By looking at procs that overlap the different metrics you can quickly determine what you should be focus on when you are looking for performance problems.
-- Top Cached SPs By Execution Count
-- Top Cached SPs By Avg Elapsed Time
-- Top Cached SPs By CPU
-- Top Cached SPs By Total Logical Reads
-- Top Cached SPs By Total Physical Reads
-- Top Cached SPs By Total Logical Writes
similar to AdHocPlanCacheBloat...this will show you the breakdown of adhoc vs prepared vs proc, etc. the USE_COUNT will show you if parameterization is working for you.
logs cases where writes>reads. This may mean the index can be dropped. But you must understand your workload. However, if Total Reads is ZERO then you can generally drop that index. Unless you have a year-end process that needs it or some other edge case where the index is only needed occassionally.
Lists transactions and query plans that are using more than 50MB of tempdb log space.
Shows spids and statements that are using more than 50MB of tempdb space.
Persists the current version of PerformanceCollector that is installed.
Data Lifecycle
The PerformanceCollector database should never get much larger than 15GB. Once per day we run the Purge AddIn which is configured to maintain 30 days of data in every PerformanceCollector table (set in the Config table). However, under extreme stress it's possible that may not be aggressive enough. In that case AddInFailSafePurge runs. If the database is about 14GB it will purge from the largest tables (the tempdb loggers). These tables hold query plans in XML format so that can get bloated quickly if left unchecked.
Next Steps/TO DO
MissingIndexesFromQueryPlansSummary and Dtls are not yet implemented. I have been running a version of this outside PerformanceCollector and it displays a lot of good missing index information out of the cached query plans instead of the DMVs. I would think this would be very similar information, but it isn't. I find this to be far more valuable because I can directly see the patterns in the query plans as well as their SQL statements and I can perhaps tweak those statements instead of just going on blind faith that the missing index DMV recommendations will help me.
We need to create an AddIn that will gather this data every x hours (maybe 13 hours).
Summary will contain the data aggregated by dbid, statement (which is really the object), and "columns". Those will be assigned an ID. We will aggregate the count of statements and the max impact.
The ID will be referenced in Dtls which we can use to see the actual sql_text and query_plan. This will be used to determine what statements the index is intending to help.
I have this stubbed out in the table scripts as well as PerformanceCollector.GatherMissingIndexesFromPlans.sql.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment