README.md 13 KB

PerformanceCollector/PerformanceEngineer-in-a-Box

This utility captures a series of performance metrics and logs them to various tables. For more information see my blog series at: http://www.davewentzel.com/PerformanceCollectorSeries.

Git Branches

  • master: this code works through SQL 2016
  • SQLAzure: this is the same code as master except it does not use SQL Agent nor any of the DMVs that are not available in Azure.
    More on this below.

How to Install Performance Collector

  1. Ensure SQL Agent is running. If it isn't PerformanceCollector won't work correctly. (SQLAzure branch works differently)
  2. Create the PerformanceCollector database on your server using your preferred file layouts. If you don't do this the Installer script will do this for you using the default data file and log placement.
  3. Run PerformanceCollectorInstaller.sql as sysadmin.

Running Performance Collector on SQLAzure

If you want to run PC on SQL Azure you need to switch to the SQLAzure branch.

SQLAzure is a PaaS offering for SQL Server (if you are running SQL Server in a VM in Azure, known as IaaS, then continue to use the master branch code). For the most part PC works great on SQL Azure but there are some differences regarding how SQL Azure databases work:

  • There is no concept of SQL Agent (asynchronous job scheduler). We have to use something else. This is not coded, you'll need to find the solution. There are multiple Azure tools that accomplish scheduled tasks, but nothing works really well given the unique requirements of PC. For instance, Azure Functions is the "cron for Azure" solution but it isn't meant to launch a task that never completes. Using AF for this is VERY expensive. Azure Scheduler and Azure Automation have similar limitations as of when this was written.
  • SQL "Servers" are used for logical groupings of SQL "Databases" only. You can theoretically put EVERY SQL Database on the same Server and Azure's "fabric" will manage resources for you. You declare how many DTUs (Database Throughput Units) you need for a given database, not for a server. For PC, this means that having a separate PerformanceCollector db on your SQLAzure Server is meaningless. Instead, PC installs as a schema on your existing SQLAzure db and monitors that. The side effect is that if you have 50 databases you are running 50 PerformanceCollector schemas and jobs. There is no good way around this.
  • In the future it may be worth snapping sys.dm_db_resource_stats to track approximate DTUs for elastic scaling of SQLAzure.
  • I have not tested this with SQL Azure Data Warehouse (SQLDW) but it should work mostly as expected. Monitor Your Workload Using DMVs

Caveats

PerformanceCollector has not been tested with Azure SQL Datawarehouse (SQLDW).

Making Changes to Performance Collector

PC is installed using PerformanceCollectorInstaller.sql. However, that's a big file to update so it's easier to add new files and change existing code in the actual .sql files in the various folders. Change whatever you need to change and then run powershell .\BuildSQL.ps1. This command will rebuild PerformanceCollectorInstaller.sql with your latest edits.

How Does Performance Collector Work?

Everything is installed in the PerformanceCollector schema within the PerformanceCollector database. There are 2 SQL Agent jobs that do the performance collection: 1. PerformanceCollector.WaitAndBlockMonitor: runs in a "WHILE 1=1" loop and will rerun every 5 mins if it is stopped. Within the loop, every 15 seconds (configurable in PerformanceCollector.Config) it will collect and blocking and waiting data to PerformanceCollector.BlockAndWaitMonitor.
2. PerformanceCollector.RunAddIns: runs every 5 mins to determine what AddIns are scheduled to run. Discussed more in the next section.

How do the AddIns Work?

There are a series of procedures named PerformanceCollector.Addin*. PerformanceCollector.RunAddIns is called every 5 minutes from the RunAddIns Job. This proc builds a cursor over ANY entry in PerformanceCollector.Config that is enabled and runs it.

The AddIns procedures are responsible for determining when the given AddIn is actually supposed to run (every 8 hours, once a day, etc). It determines this from querying PerformanceCollector.Config. It then calls the actual worker proc if it needs to.

The Most Important Object: PerformanceCollector.BlockAndWaitMonitor

This is a view that shows you any process that is waiting or blocking/blocked and the cause. This is snapshotted from various DMVs every 15 seconds. In general people will complain about application performance by saying, "The app was slow from 1:45 to 2:00 today." In that case start by querying this view by filtering by CurrTime. You'll likely find the cause of the problem.

Manually Snapping a Spid

You can manually run EXEC PerformanceCollector.WaitAndBlockInspector at any time and it will log current waiting and blocking data. This may be useful if you notice your query is taking longer than it should. You can also pass a @spid parameter like this "PerformanceCollector.WaitAndBlockInspector @spid = 100;" and it will show you any waiting and blocking as well as any activity for the spid. This data is output to the screen and logged in the table.

Other Objects

Abandoned Spids

This is disabled BY DEFAULT This view will also show "abandoned spids." An abandoned spid is any spid with an open transaction for at least 60 seconds that is blocking something else. Here is the most important aspect...the spid is sitting on AWAITING COMMAND. That is what differentiates an abandoned spid from an ordinary blocking spid. When the spid is sitting on AWAITING COMMAND that means the SQL Server is waiting for another command from the spid to do something...most likely issue a COMMIT TRAN.

I have never seen a case where the spid will come back and issue the COMMIT. In every case this is a data access framework that thinks it is NOT in implicit_transaction mode and SQL Server thinks it is. For instance, various combinations of jdbc and websphere produce this behavior because jdbc does not issue sp_reset_connection commands like .NET does.

An abandoned spid is first logged after 60 seconds. After 5 minutes of being in an abandoned state, and after another spid has been blocked by it for 60 seconds, the spid is killed. It is logged to PerformanceCollector.SpidsKilled.

Where do I find Other Interesting Performance Metrics?

These are the tables that you can query. Everything is under PerformanceCollector schema. Each table is structured so the first columns are CurrTime and ServerName. This aids in troubleshooting. You can also aggregate all of this data across multiple customers/instances to do trending.

AdHocPlanCacheBloat

every 7 hours logs single-use ad-hoc query plans that are a total waste of plan cache space. If this is really big then the solution is to run 'optimize for ad hoc workloads' which reclaims all of the memory and makes sure that ad hoc plans are not full cached. This is a "analyze once" script. Once you've enabled 'optimize for ad hoc' then you are done. But we run this in case a new framework (ruby maybe) is deployed which causes a radical increase in ad hoc plans. You can also use the text to determine where you are not doing parameterized queries correctly.

BlockAndWaitMonitor

this is a view on top of BlockAndWaitStats that decodes a lot of data. This is the main table you want to use. Generally there sill be a report of "slowness" or blocking or whatever at a given time. Query this view first and it will give you a good idea as to which procs were running and what they were blocking/blocked by, waiting on, etc.

BufferStats

Gathers the buffer stats (counts and size per object per db).

Config

You can configure certain AddIns to NOT run or change their schedule here.

DeadlockSnapshot

shows you all deadlocks that have occurred and tries to decode the XML to show the Victim and Survivor and what they were doing. Please note that there is a lag time between when a deadlock occurs and when it is decoded here that could be AT MOST 8 hours. You can manually run the Deadlock AddIns to do the decoding at any time if you really need to see up-to-the-minute deadlock data.

IndexBlocking

collected every 4 hours via Addin_IndexMetrics. Gets the amount of time each index is spent in a blocking/waiting state.

IndexesFromQueryPlansDtls

Not yet implemented. This will read the missing indexes from the query plans instead of relying on the DMVs. This seems to be far more accurate. This will list the TotalCost (somewhat nebulous) as well as the table and index columns that the optimize believes would be beneficial.

IndexesFromQueryPlansSummary

Not yet implemented. An aggregate of the the Dtls table that quickly shows the best indexes to be created.

IOStalls

lists the IO stalls by database and filetype.

MissingIndexes

snaps the sys.dm_db_missing_index* DMVs. However, there is a better way to do this that looks just at cached query plans. Even so, this will show every egregious examples of missing indexes.

MissingIndexesFromQueryPlansSummary

Not yet implemented. See notes below.

MissingIndexesFromQueryPlansDtls

Not yet implemented. See notes below.

PageLifeExpectancies

logs the PLEs for the instance. This runs every 5 mins.

PerfCounters

Gathers instance-level perf counters in EAV format for aggregation later.

PlanCacheMetrics

every 7 hours (via AddIn_PlanCache) logs stored proc stats (reads/writes/cpu/duration) by Metric. By looking at procs that overlap the different metrics you can quickly determine what you should be focus on when you are looking for performance problems. -- Top Cached SPs By Execution Count -- Top Cached SPs By Avg Elapsed Time -- Top Cached SPs By CPU -- Top Cached SPs By Total Logical Reads -- Top Cached SPs By Total Physical Reads -- Top Cached SPs By Total Logical Writes

PlanTypeStats

similar to AdHocPlanCacheBloat...this will show you the breakdown of adhoc vs prepared vs proc, etc. the USE_COUNT will show you if parameterization is working for you.

PossibleBadIndexes

logs cases where writes>reads. This may mean the index can be dropped. But you must understand your workload. However, if Total Reads is ZERO then you can generally drop that index. Unless you have a year-end process that needs it or some other edge case where the index is only needed occassionally.

TempDBLogWatch

Lists transactions and query plans that are using more than 50MB of tempdb log space.

TempDBUsage

Shows spids and statements that are using more than 50MB of tempdb space.

Version

Persists the current version of PerformanceCollector that is installed.

Data Lifecycle

The PerformanceCollector database should never get much larger than 15GB. Once per day we run the Purge AddIn which is configured to maintain 30 days of data in every PerformanceCollector table (set in the Config table). However, under extreme stress it's possible that may not be aggressive enough. In that case AddInFailSafePurge runs. If the database is about 14GB it will purge from the largest tables (the tempdb loggers). These tables hold query plans in XML format so that can get bloated quickly if left unchecked.

Next Steps/TO DO

MissingIndexesFromQueryPlansSummary and Dtls are not yet fully implemented. I have been running a version of this outside PerformanceCollector and it displays a lot of good missing index information out of the cached query plans instead of the DMVs. I would think this would be very similar information, but it isn't. I find this to be far more valuable because I can directly see the patterns in the query plans as well as their SQL statements and I can perhaps tweak those statements instead of just going on blind faith that the missing index DMV recommendations will help me.

I have this stubbed out in the table scripts as well as PerformanceCollector.GatherMissingIndexesFromPlans.sql.