At my job we have a [b]very[/b] read-heavy workload, not just reporting but high throughput and fast selects, on our synchronous secondary SQL Server. Every once in a while a couple procedures, that are pretty high volume 50/s or so, will all of a sudden start blocking each other and timeout for a short period of time, 2-5 minutes or so, until (it appears) a new plan is generated and locked in. I have tried using OPTION(USE PLAN) on the queries (last resort), and T-SQL optimizations to try to deter the problem as much as I could as I investigate. But I think I have finally caught why and possibly what is causing this.We had this incident occur again yesterday and I was able to log into the server in time and use sp_WhoisActive to see the blocking. I tried recompiles, and in the end it ended up clearing after a couple minutes (I don't believe the recompiles worked, because there was no plan because there was no memory!.. more on that in a sec) However I also (this time) was able to catch current memory grants and found a ton of StatsMan queries running building temp statistics across many tables, of which tables the procedures were querying. There was a [b]substantial[/b] increase in the amount of IOPs across our two Data disks at the start of the incident and continued throughout. The main data drives went from averaging about 1K IOPs to dramatically jumping up to ~ 7K IOPs EACH! The free memory on the SQL instance itself was quickly and severely brought to its knees at this time going from roughly 14 GB available to 0.Since there was no memory available, it leads me to believe SQL Server resorted to cache-plan eviction and caused these procedures to be unable to generate and cache a new execution plan which is why they were blocking each other waiting on the plan.We currently do have Auto Update Statistics on.My question is, what are your thoughts on a solution to this problem? I've thought about updating stats on the primary more often. Currently we update outdated statistics once per day, but that seems like it could cause another ticking time bomb and recompiles (maybe appropriately so and for the better). I don't think there is a way to disable temp statistics on the secondary readable server afaik, as this is an internal SQL optimization.. assuming different workloads across primary/secondary replicas. It would be nice if there was a way to throttle temp statistics builds as it looks like it did them in parallel across MANY tables at the same time which caused the IO storm.For now I will resort to logging occurrences to correlate any more issues and look into more proactively updating statistics.Any thoughts/insight is greatly appreciated! Thanks
↧