Cegal tech blog

Understanding Worker Thread Challenges in Always On Availability Groups on a Multi-Instance Cluster with Hundreds of Databases

Written by Leo Ismatov | Sep 17, 2024 6:58:09 AM

In this article, I want to discuss a challenge one of my clients recently faced: insufficient worker threads in an SQL Server environment and how it affected performance. The client was running two clustered physical machines with 8 SQL Server instances, all configured with Availability Groups (AG). The issue arose from the number of worker threads being too low to handle the workload effectively, particularly in the instance with the most databases.

Overview of the SQL Server Environment

The client's environment consisted of the following SQL Server instances and their associated databases:

 Instances  AG1 AG2 AG3 AG4 AG5 AG6 AG7 AG8 Total
Number of databases 27 16 18 1 28 97 2 230 419

 

The cluster servers had the following specifications:

  • Sockets: 2
  • Cores: 24
  • Logical Processors: 24
  • Memory: 768 GB

Please note that enabling hyperthreading inflates the logical processor count without providing more physical hardware resources. This can mislead SQL Server into believing that it has more CPU resources than it actually does. As a result, too many threads can be scheduled based on the inflated logical processor count, while the actual physical cores struggle to keep up.

The Error Message

The monitoring system repeatedly flagged an alert: Free thread count is too low on SQL Server instance AG8. It clearly indicated that the instance with 230 databases in AG was approaching its maximum capacity for worker threads that were essential for performing various tasks, including processing SQL queries.

How Worker Threads Are Used in Availability Groups

Worker threads in AGs are critical for synchronizing data between the primary and secondary nodes. Here is how they work:

  • Primary Node:
    • Log Capture Thread: Captures transaction log changes to replicate to secondary replicas.
    • Log Send Threads: One thread per secondary database to send these changes.
  • Secondary Node:
    • Redo Threads: One thread per database to apply the changes received from the primary node.

Additionally, an active AG reserves around 40 worker threads for its internal operations, which reduces the available pool for other tasks. These operations include managing transaction metadata, maintaining replica connections, and monitoring AG health status. In contrast, an idle AG uses zero worker threads. SQL Server will release a worker thread if it remains inactive for 15 seconds.

When analyzing worker threads in an AG, it is important to determine their maximum default capacity, the minimum required number, and their current usage

Monitoring Current Usage of Worker Threads

The following T-SQL script helps to monitor the current state of worker threads in SQL Server, including:

  • Active Threads: The number of worker threads currently in use.
  • Available Threads: The number of worker threads available for new tasks.
  • Workers Waiting for CPU: Tasks that are ready to run but waiting for CPU resources to become available.
  • Requests Waiting for Threads: Requests that are waiting to be assigned a worker thread.
  • Associated Workers: Worker threads currently handling SQL Server operations and background tasks.
  • Percent of Active Threads: The percentage of total worker threads currently in use.

Calculating Maximum Worker Threads

The default maximum worker threads on a server can be calculated using the following formula:

Max Worker Threads = 512 + ((Logical CPUs − 4) × 16)

In my client's case, the maximum default equals to 832.

512 + ((24 − 4) × 16) = 832

One way to verify this in SQL Server Management Studio (SSMS) is by running the following query:

After accounting for the 40 worker threads reserved by the AG the result would be 792:

832 −40 =792

Calculating Minimum Worker Threads Required

The minimum number of worker threads required for an AG can be determined by the following formula:

Min Worker Threads AG = D × (LCWT + (LSWT × SRC) + MHWT)

Where:

  • D: Number of databases
  • LCWT: Log Capture Worker Thread
  • LSWT: Log Send Worker Thread
  • SRC: Number of secondary replicas
  • MHWT: Message Handler Worker Thread

For example, AG8 requires a minimum of 461 worker threads:

230 × (1 + (1 × 1) + 1) = 461

After subtracting the minimum required threads for AG8, only 331 threads remain available for all other operations:

792 – 461 = 331

The Problem

While increasing the maximum worker threads might seem like a quick fix at first, it doesn't address the root cause. For instance, raising the maximum worker threads from 832 to 1832 in AG8 would simply add 1,000 more tasks to handle into the CPU's processing queue. Regardless of how much we increase the maximum worker threads, the capacity will always be constrained by the underlying hardware resources. First, we need to understand the SQL Server environment, the workload and assess the available hardware resources.

Implications for Multi-Instance Environments

In a single-instance environment with adequate workloads, 331 threads might be sufficient. However, with 8 instances running on the same machine, resource contention becomes a critical concern. If other instances also consume significant numbers of threads, the overall performance can deteriorate rapidly.

Other Operations

It is important to consider the threads required for other operations such as query executions, backups, integrity checks, index maintenance etc. These activities can quickly exhaust the available worker threads, leading to connection timeouts in SSMS and performance issues. The error message my client encountered is a symptom of an overloaded system, due to the high number of databases in AG8, multiple instances and heavy workloads.

Temporary and Long-term solutions

Although Microsoft does not impose a strict limit on the number of AGs or databases, it has only tested configurations with up to 10 AGs and 100 databases. Beyond these numbers, the system's hardware capacity and workload become the determining factors. In our case, with 8 AGs and 419 databases, the system was clearly under strain. We decided to take immediate action and implement the following temporary workaround as well as plan for a long-term solution.

Temporary workaround

  • Move some of the databases to other instances on different servers. Priority was given to the databases with heavy workloads consuming significant CPU resources, thereby degrading overall performance and slowing down synchronization between nodes.
  • Review the SQL environment to identify and remove unused databases that had been overlooked.
  • Check in with database owners and move databases that no longer require high availability to standalone instances.

Long-term solution

  • Order new VMs with the latest SQL Server versions to reorganize the SQL environment into separate clusters with fewer databases per AG. 

Conclusion

To address the issue of worker threads deficiency, it's essential to look beyond increasing the max worker thread setting. A thorough analysis of the workload, careful management of resources, and possibly even restructuring the whole SQL environment and the number of databases in each AG are necessary steps. Ultimately, the goal is to ensure that the system remains within its hardware limits while maintaining optimal performance.

I hope this was insightful 💚 Don't hesitate to reach out if you have any questions!