Looking for strange

You might have things going on in your SQL server. Things you want to know that they happen. Things you don't want to happen. How can you become aware of such things?

You can catch them with a trace. I like to think of this trace as my KMDFM trace, from the song title "Looking for strange". At this stage, I don't want to capture the actual event, I just to know if any such events are going on.

I'm of course using Extended Events (XE) for this. There's a target named event_counter which does just that, it counts how many of each event has occurred. The idea is to start the trace on a morning and let it run over the day. Then by the end of the day you see how many of these unwanted things are going on in your SQL Server. The rest is up to you. You can for instance modify the trace (using the GUI in SSMS if you want) to add a trace file. Start it again and capture the actual events and take it from there. One way to see the trace data is to expand the session name and double-click on the target. Note that it doesn't auto-refreshes by default while the window is open.

So what events to pick? You only want things that are outside the normal, whatever that might be. You don't want to be bombarded with useless stuff. At the end of this article, you find my trace.

I am grateful for feedback, to know if I've missed anything. Hopefully this can evolve over time. For instance:
  • Some event shouldn't be there, it is irrelevant or it doesn't work.
  • Some event need further filtering to get rid of unwanted noise
  • I should add this event, possibly with some certain filter (please elaborate)
  • Or whatever...
Don't be afraid to customize this trace! Are there events that you don't want to capture? Just delete those rows before you create the trace! Etc. You're the boss!

The events

I'll start by listing the events, and talk a bit what they mean and why I want them in the trace. I also include the XE documentation text for each events, quoted. I omitted the package name for readability, go check the CREATE EVENT SESSION command at the end for details.

General events

Attention. This fires when a client times out. The client might do a re-try, so then end-user isn't even aware that this is happening, except that the performance sucks. The time-out period is specified at the client side, in the database API. "Indicates that a cancel operation, client-interrupt request, or broken client connection has occurred. Be aware that cancel operations can also occur as the result of implementing data access driver time-outs."

Auto_stats. If you have lots of auto update statistics going on, then you might want to do something about that. The user submitting a query when auto-update occurs will notice a halt while the update happens (unless set to async, but then the query is ran with inferior statistics). Perhaps you want to run manual update more frequently? Or remove some table/index/stats from the auto-update handling? I had to work quite heavily with filtering for this to ger rid of noise. "Occurs when index and column statistics are automatically updated. This event can be generated multiple times per statistics collection when the update is asynchronous."

Database_file_size_change. The change can be for several reasons. A grow or shrink can be automatic or manual. This should capture all. You should be aware of manual size changes, and they shouldn't be frequent. But frequent autogrow you might want to handle by manually growing the file or changing the increment. And frequent autosrink is so bad and I doubt I have to explain any further on this. Hopefully you don't have autosrink turned on for any of your databases. If this happens a lot,. consider adding a filter to only catch shrink operations and/or add a file target. "Occurs when any of the data or log files for a database changes size. Use this event to monitor changes to the size of database files."

Database_started. You probably want to know if databases are created, attached frequently. Not to mention f you have autoclose turned on for any databases and that causes a database to be started. "Database started."'

Lock_timeout_greater_than_0. If a programmer specified to get a time-out while waiting for a log, then this can fire. Why not the evant simply named "lock_timeout"? For two reasons: One is that I noticed it fires more then once for each occurrence. And the other reason is that a developer might have deliberately set a timeout of 0, to immediately get an exception, and those you probably don't want to catch. "Occurs when a request for a lock times out because of a blocking lock that is being held by a different resource. This event is identical to lock_timeout, but excludes cases in which the time-out value is 0, such as when lock probes are used. Time-outs with a duration of 0 do not necessarily indicate a blocking problem."

Long_io_detected. You might have seem messages in the log that an I/O operation took longer than 15 seconds. This event fires when that happens. I didn't include this in the performance section since 15 seconds is such a ridiculous long time for an I/O operation and something is clearly not normal if this event fires. "Long IO detected"

Lock_deadlock. Hopefully this is obvious. If you have lots of deadlocks, then you want to do something about it. Perhaps run a trace and catch the xml_deadlock_report event, and then pass this on to the developers. "Occurs when an attempt to acquire a lock is canceled for the victim of a deadlock."

Lock_escalation. When SQL server escalates a log to a higher level (table), then it first acquired lots is fine-grained locks, then the use the sledge by acquiring a table level lock. You don't want this to happen too frequently, adding indexes or other measures can reduce lock escalations. "Occurs when a fine-grain lock is converted into a coarser-grain lock, such as when a row lock is converted into an object lock. Use this event to monitor how locks are being used on the server."

Error_reported. This was suggested by Ekrem Önsoy. It is just what it says, fires when an error occurs. You might be curious if you have bunch of errors. and if you do, modify the trace and add a file target to capture those errors. I kept the filters that Ekrem syggested, but do check the message_id in sys.messages if you want to customize. Or add further filters, if needed.

I also wanted to have the plan_affecting_convert event, to capture when some covers happens in a WHERE clause so that it makes it impossible to use an index seek. But SQL server seems to include anything with a conversion here, including conversion in the column list. This makes the event useless, IMO.

Performance related events

Check the trace definitions for these. Some might happen a lot. If you aren't interested in these, then just skip them. Perhaps you can't do anything about it in the first place?

Thanks for providing feedback and suggestions for this article:
Ekrem Önsoy, suggested the addition of error_reported.
Much appreciated!


The code

Below you find the XE session. Note that you can create it using T-SQL (as below) and then use the XE GUI in SSMS to modify the trace. And then script it out again. etc...

--Stop trace if started
IF EXISTS (SELECT * FROM sys.dm_xe_sessions WHERE name = 'LookingForStrange')

--Delete trace if exists
IF EXISTS (SELECT * FROM sys.server_event_sessions WHERE name = 'LookingForStrange')

--Create trace
ADD EVENT sqlserver.attention(
  WHERE (package0.greater_than_uint64(database_id,(4))
  AND package0.equal_boolean(sqlserver.is_system,(0)))) ,
ADD EVENT sqlserver.auto_stats(
  WHERE ([database_id]>(4) AND [sqlserver].[is_system]=(0) AND [object_id]>(10000000) AND [duration]>(10))),
ADD EVENT sqlserver.database_file_size_change,
ADD EVENT sqlserver.database_started,
ADD EVENT sqlserver.lock_deadlock,
ADD EVENT sqlserver.lock_escalation,
ADD EVENT sqlserver.lock_timeout_greater_than_0,
ADD EVENT sqlserver.long_io_detected,
ADD EVENT sqlserver.error_reported(
  ACTION(sqlserver.client_app_name, sqlserver.client_hostname,
        sqlserver.database_name, sqlserver.sql_text, sqlserver.username)
  WHERE (package0.greater_than_int64(severity,(10))
  AND NOT sqlserver.like_i_sql_unicode_string(sqlserver.client_app_name,N'Microsoft SQL Server Management Studio%')
  AND package0.not_equal_int64(error_number,(596))
  AND package0.not_equal_int64(error_number,(9104))
  AND error_number<>(17830) AND error_number<>(10060))),

--Begin performance section
ADD EVENT qds.query_store_plan_forcing_failed,
ADD EVENT sqlserver.exchange_spill,
ADD EVENT sqlserver.execution_warning,
ADD EVENT sqlserver.hash_spill_details,
ADD EVENT sqlserver.hash_warning,
ADD EVENT sqlserver.optimizer_timeout,
ADD EVENT sqlserver.query_memory_grant_blocking,
ADD EVENT sqlserver.query_memory_grants,
ADD EVENT sqlserver.sort_warning,      
ADD EVENT sqlserver.window_spool_ondisk_warning
--End performance section

ADD TARGET package0.event_counter

--Start trace