Incorrect I/O throttling when using block device files

Matt Jacobson
February 2022

Summary: Apple has a bug in its disk I/O throttling code that affects raw disk copies. I walk through my diagnosis.

This week, I was restoring a slow-as-molasses 2.5-inch HDD to an SSD, as part of an upgrade for my mom's work machine.

For whatever reason, the USB SATA controller I was using seems to limit me to 15 MB/s and 4 kB transfers. Not ideal when I'm transferring 500 gigs, but oh well. I can throw it in the background and come back to it later.

So, after triple-checking for any particularly day-ruining typos, I fired off:

# dd if=/dev/disk2 of=/Volumes/ExternalHDD/mom.img bs=1m

and went off to do other things on my machine.

Over the next few minutes, I started noticing my machine behaving strangely. Apps became sluggish; eventually, some practically ground to a halt. Some apps showed the spinning pinwheel wait cursor, while others stopped updating while remaining responsive to clicks. Some apps displayed nonsensical error alerts. Others apps were ostensibly unaffected and continued working fine. At one point (and very much to my frustration), even Terminal stopped responding.

Initial triage

I started by checking the obvious stuff. Memory usage was normal—dd was correctly only using a one meg buffer. CPU utilization was high, but not outrageously so. dd was reading from one external disk and writing to another, so there wasn't any contention for the main disk.

It was time to break out spindump to see where things were wedged. spindump is a super-versatile whole-system callstack profiling tool that comes with macOS; it's what macOS uses to generate those You forced Safari to quit. diagnostic reports. Unlike sample, it can profile more than one process, and it captures the kernel backtrace, too.

I profiled various sluggish-feeling operations. It was easy to spot a pattern: lots of threads blocked in throttle_lowpri_io—specifically, threads whose progress was required (directly or indirectly) for the app to make progress. Here's one particularly reproducible example: FaceTime hangs on launch as its main thread tries to call an Objective-C +initialize method; at the same time, a secondary thread, holding the +initialize lock, is stuck in throttle_lowpri_io:

Process:          FaceTime [38810]

Thread 0xabb6c DispatchQueue "com.apple.main-thread"(1)
[...]
  1000  _objc_msgSend_uncached (libobjc.A.dylib)
    1000  lookUpImpOrForward (libobjc.A.dylib)
      1000  initializeAndMaybeRelock (libobjc.A.dylib)
        1000  initializeNonMetaClass (libobjc.A.dylib)
          1000  WAITING_FOR_ANOTHER_THREAD_TO_FINISH_CALLING_+initialize (libobjc.A.dylib)
            1000  __psynch_cvwait (libsystem_kernel.dylib)
             *1000  psynch_cvcontinue (pthread)

Thread 0xabb8f DispatchQueue "com.apple.FaceTime.FTRecentsController.serialQueue"(177)
[...]
  839  lookUpImpOrForward (libobjc.A.dylib)
    839  initializeAndMaybeRelock (libobjc.A.dylib)
      839  initializeNonMetaClass (libobjc.A.dylib)
        839  CALLING_SOME_+initialize_METHOD (libobjc.A.dylib)
          839  +[SGSuggestionsService initialize] (CoreSuggestions)
            839  +[NSDictionary dictionaryWithObjects:forKeys:count:] (CoreFoundation)
              839  __NSDictionaryI_new (CoreFoundation)
                643  __CFStringHash (CoreFoundation)
                 *643  hndl_alltraps (kernel)
                   *643  user_trap (kernel)
                     *643  <unsymbolicated> (kernel)
                       *643  throttle_lowpri_io (kernel)
                         *643  thread_block_reason (kernel)

In some cases, the trail of thread dependencies crosses process boundaries. spindump usually does a good job of telling you what thread to look at next. Here's Messages hanging on launch:

Process:          Messages [39192]

Thread 0xae38d DispatchQueue "com.apple.main-thread"(1)
[...]
  1001  +[ABAddressBook sharedAddressBook] (AddressBookCore)
    1001  _AB_Lock (AddressBookCore)
      1001  _pthread_mutex_firstfit_lock_slow (libsystem_pthread.dylib)
        1001  __psynch_mutexwait (libsystem_kernel.dylib)
         *1001  psynch_mtxcontinue (pthread) (blocked by turnstile with priority 47 waiting for Messages [39192] thread 0xae39e)

Thread 0xae39e DispatchQueue "com.apple.root.background-qos"(6)
[...]
  848   -[ABXPCACAccountStore contactsAccountsWithFetchOptions:] (AddressBookCore)
    848   _CF_forwarding_prep_0 (CoreFoundation)
      848   ___forwarding___ (CoreFoundation)
        848   -[NSXPCConnection _sendInvocation:withProxy:] (Foundation)
          848   -[NSXPCConnection _sendInvocation:orArguments:count:methodSignature:selector:withProxy:] (Foundation)
            848   __NSXPCCONNECTION_IS_WAITING_FOR_A_SYNCHRONOUS_REPLY__ (Foundation)
              848   xpc_connection_send_message_with_reply_sync (libxpc.dylib)
                848   dispatch_mach_send_with_result_and_wait_for_reply (libdispatch.dylib)
                  848   _dispatch_mach_send_and_wait_for_reply (libdispatch.dylib)
                    848   mach_msg_trap (libsystem_kernel.dylib)
                     *848   ipc_mqueue_receive_continue (kernel) (blocked by turnstile waiting for ContactsAccountsService [38314] thread 0xae24d after 2 hops)

Process:          ContactsAccountsService [38314]

Thread 0xae24d DispatchQueue "com.apple.tcc.preflight.kTCCServiceAddressBook"(66)
[...]
  997  __TCCAccessRequest_block_invoke.96 (TCC)
    997  tccd_send_message (TCC)
      997  xpc_connection_send_message_with_reply_sync (libxpc.dylib)
        997  dispatch_mach_send_with_result_and_wait_for_reply (libdispatch.dylib)
          997  _dispatch_mach_send_and_wait_for_reply (libdispatch.dylib)
            997  mach_msg_trap (libsystem_kernel.dylib)
             *848  ipc_mqueue_receive_continue (kernel) (blocked by turnstile waiting for tccd [38170] thread 0xae3f6 after 2 hops)
             *149  ipc_mqueue_receive_continue (kernel) (blocked by turnstile waiting for tccd [38170] thread 0xae0a7 after 2 hops)

Process:          tccd [38170]

Thread 0xae0a7,0xae3f6 DispatchQueue "com.apple.root.default-qos.overcommit"(11)
[...]
  1001  Security::MachO::dataAt (Security)
    1001  pread (libsystem_kernel.dylib)
     *1001  hndl_unix_scall64 (kernel)
       *1001  unix_syscall64 (kernel)
         *1001  throttle_lowpri_io (kernel)
           *1001  thread_block_reason (kernel)

To simplify, we have this mess:

Messages main thread
 ↳ blocked on Messages secondary thread
    ↳ blocked on ContactsAccountService
       ↳ blocked on tccd

On Darwin, threads block inside throttle_lowpri_io when they're being artificially delayed to slow down their I/O operations, with the ultimate goal of optimizing the performance of higher-priority I/O. And indeed, in both of these cases (and in the other similar problems I saw), the chain of blockage ultimately leads to a thread with less-than-highest I/O priority.

Once you start looking, it's not hard to find dozens of examples of this just through normal usage of different apps.^[1] Why so many UI commands on macOS are blocking on low-priority threads is less a technical question than one of Apple's institutional priorities. (A question that I have lots to say about, but not here.)

But I was interested in a different question. Ordinarily, I/O operations on one disk device don't throttle those on another device. This makes sense, because multiple devices can operate independently without contention. Yet in this case, all of the throttled I/O operations were to the main disk (containing the root and user data volumes), while my dd was copying from one external disk to another. There was no possibility of contention. So why was throttling still happening?

Throttling domains

To keep track of which I/Os should be throttled, the Darwin kernel maintains what I'll call throttling domains (the source calls them struct _throttle_io_info_t). In rough terms, each throttling domain is meant to correspond one-to-one to a disk device.

When an I/O is issued through the spec_strategy routine, the kernel has to determine which throttling domain the operation lives in, so that the operation may either be throttled or cause throttling of lower-priority operations. The throttling domain is determined first by taking the vnode (i.e., file) the I/O is being done to and walking up to its enclosing mount_t.

From there, the code looks at the mount's mnt_devbsdunit property. The mnt_devbsdunit describes the "disk number" of the device the filesystem lives on. If a filesystem is mounted from /dev/disk3, then the mount's mnt_devbsdunit is 3. If the backing disk is actually a partition of a disk, then the number comes from the whole disk, not the partition; e.g., /dev/disk3s2 results in 3.^[2]

The mnt_devbsdunit—which can range from 0 to 63^[3]—determines which throttling domain is in play.

I find it useful to back up a theory with an example. One good way here is to instrument the kernel with (the recently neglected) dtrace. The following dtrace script triggers on spec_strategy calls and logs out the vnode, mount point, and computed mdt_devbsdunit:

fbt::spec_strategy:entry
{
    this->strategy_ap = (struct vnop_strategy_args *)arg0;
    this->bp = this->strategy_ap->a_bp;
    this->vp = this->bp->b_vp;
    this->mp = this->vp->v_mount;

    this->unit = this->mp ? this->mp->mnt_devbsdunit : 0;

    printf("strategy: vp=%p (%s), mp=%p (%s), unit=%u",
        this->vp, stringof(this->vp->v_name),
        this->mp, this->mp ? stringof(this->mp->mnt_vfsstat.f_mntonname) : "???",
        this->unit);
}

Here's an example of what that script outputs:

strategy: vp=ffffff802d78e400 (mom.img), mp=ffffff801fc65a20 (/Volumes/ExternalHDD), unit=3

This says that a process is doing I/O to read or write a file named "mom.img", which lives in the mountpoint /Volumes/ExternalHDD. Here's what mount(8) tells us about that mountpoint:

$ mount | fgrep /Volumes/ExternalHDD
/dev/disk3s2 on /Volumes/ExternalHDD (hfs, local, nodev, nosuid, journaled, noowners)

The 3 in disk3s2 matches the unit=3 in the dtrace output, as we expect.

But here's a second example:

strategy: vp=ffffff8029027e00 (History.db), mp=ffffff801fc67880 (/System/Volumes/Data), unit=0

$ mount | fgrep /System/Volumes/Data
/dev/disk1s1 on /System/Volumes/Data (apfs, local, journaled, nobrowse)

Here, the 1 in disk1s1 does not match the unit=0 from dtrace. Why?

Logical volume groups and `mnt_throttle_mask`

Apple added a logical volume manager, called CoreStorage, to Mac OS X Lion. In contrast to traditional disk partitions, in which a contiguous range of a disk device is used as a volume, CoreStorage allows a looser relationship between volumes and backing storage. For instance, a volume might use storage from multiple different disk devices—witness Fusion Drive for example.

This complicates the mnt_devbsdunit situation. Suppose a filesystem is mounted from volume disk2. According to the previous rules, mnt_devbsdunit is 2. However, disk2 might be a CoreStorage logical volume, backed by the real disk devices disk0 and disk1.

Moreover, CoreStorage might not be the only user of disk0 and disk1. Suppose further a second, non-CoreStorage volume on disk0, called disk0s3. I/Os to disk2 and disk0s3 may contend with each other. But the mnt_devbsdunit of disk0s3 is 0, so the two mounts will be in different throttling domains.

To solve this, enter a second mount_t field, mnt_throttle_mask. mnt_throttle_mask is a 64-bit bit array. A bit is set only when I/Os to the mount may involve the correspondingly numbered disk device. For our CoreStorage logical volume disk2, since disk0 and disk1 are included, bits 0 and 1 are set. Bit 2 is also set for the logical volume itself, so the overall mask is 0x7.

In theory, you might imagine a system wherein a mount could reside in multiple throttling domains. Or perhaps the throttling domain decision could be pushed down so that CoreStorage could help make smart decisions about which to use for a particular I/O operation.

The implemented reality is much more mundane. mnt_devbsdunit is set to the index of the lowest bit set in mnt_throttle_mask. For disk2, since bit 0 is set, mnt_devbsdunit is 0. So disk2 and disk0s3 live in the same throttling domain (though, notably, a theoretical disk1s3 would not).

This explains what's happening with /System/Volumes/Data above. disk1s1 is a logical volume presented by a volume manager^[4], and its backing storage is on disk0. Tweaking the dtrace script shows that mnt_throttle_mask is 0x3:

strategy: vp=ffffff8029027e00 (History.db), mp=ffffff801fc67880 (/System/Volumes/Data), mask=3, unit=0

devfs weirdness

Popping the stack back to the original problem, it's now interesting to look at the I/Os done by dd through the lens of the dtrace script. This is what they look like:

strategy: vp=ffffff802c603600 (disk2), mp=ffffff801fc696e0 (/dev), mask=3f, unit=0

Notice first that spec_strategy is being asked to do an I/O on the file /dev/disk2, not a regular file as before. Though unusual, this makes sense: no filesystem is actually mounted from the disk2 device, and dd is explicitly attempting to read the disk2 special file itself.

As a side effect, the mount point is deduced to be /dev/. Again, this is unusual, but it seems to be the least unreasonable option: there is nothing mounted from disk2 itself, and /dev/ is the mount point that holds the disk2 special file.

This is where things go weird, though. dtrace reports the mnt_throttle_mask of /dev/ to be 0x3f. In other words, /dev/ claims to be similar to a logical volume made up of exactly these disks: disk0, disk1, disk2, disk3, disk4, and disk5.

Never mind that there are no disk4 or disk5 attached to my system. What on earth would this even mean? /dev/ is the mount point of a synthetic filesystem (devfs) of special files. Sure—some of those special files do indeed correspond to disk devices. But others correspond to non-disk devices. Still others, like /dev/null, are completely fabricated by software.

And, perhaps most curiously, why does the list stop at disk5?

This question is perhaps best answered by imagining what a reasonable value of mnt_devbsdunit would be for devfs. In an ideal world, perhaps each vnode in devfs might be assigned a throttling domain independently, such that /dev/disk0 lived in the 0 domain, /dev/disk1 in the 1 domain, etc.

Unfortunately, the reality of the design allows us to assign only a single mnt_devbsdunit for all of devfs. So a reasonable, if far from ideal, solution is to assign 63—a value that will put devfs in its own throttling domain, as long as fewer than 64 disk devices are attached.

The bug

Assigning 63 is in fact what the code did, prior to Lion. When a mount is created, backed by a device vnode are assigned the BSD unit value of the vnode:

if (device_vnode != NULL) {
        VNOP_IOCTL(device_vnode, DKIOCGETBSDUNIT, (caddr_t)&mp->mnt_devbsdunit, 0, NULL);
        mp->mnt_devbsdunit %= LOWPRI_MAX_NUM_DEV /* 64 */;
}

All others, like devfs, are assigned a backstop value:

mp->mnt_devbsdunit = LOWPRI_MAX_NUM_DEV /* 64 */ - 1;

Unfortunately, when mnt_throttle_mask was introduced in Lion, the backstop value was changed to:

mp->mnt_throttle_mask = LOWPRI_MAX_NUM_DEV /* 64 */ - 1;
mp->mnt_devbsdunit = 0;

This seems wrong! First, remember that mnt_throttle_mask is a 64-bit bit array, whereas mnt_devbsdunit is an ordinal. There are various backstop values that might make sense for mnt_throttle_mask: 0 (all bits cleared) or ~0 (all bits set) are two obvious candidates.

But LOWPRI_MAX_NUM_DEV - 1—in other words, 63 or 0x3f—is not one of them, and it's pretty clear that the value was incorrectly copied from the old fallback initialization of mnt_devbsdunit.

Second, and more importantly, the backstop value of mnt_devbsdunit switched from 63 to 0. 0 is indeed the "correct" value corresponding to a mask of 0x3f, but such a change puts all of devfs in the same throttling domain as disk0.

To preserve the old behavior, I'd suggest these backstop values instead be: 1ULL << 63 for the mask, and—correspondingly—the old value of 63 for mnt_devbsdunit.

A few workarounds

Use the character device

Each disk device is represented with two special files: a block device (e.g., /dev/disk0, like the ones I've been using above) and a character device (e.g., /dev/rdisk0). I/O operations to the block device go through the buffer cache and therefore through spec_strategy, as outlined above. I/O operations on the character device, however, bypass the buffer cache and spec_strategy entirely.

Notably, though, I/Os on the character device don't bypass throttling. For I/O on a character device, there's special code to determine the correct throttling domain. But since no mount points are involved here, the throttling domain is determined based on the vnode alone. This is pretty much exactly what we want (and what we couldn't do when we had to assign a single throttling domain to all of devfs).

For some use cases, the fact that the character device bypasses the buffer cache could be a problem. Otherwise, this seems like the optimal solution.^[5]

Use `IOPOL_PASSIVE`

In addition to assigning a priority tier to its I/O operations, a process may mark its I/O as passive; passive I/O may be throttled but doesn't cause throttling of other I/Os.

Recompiling dd to call setiopolicy_np(3) would be a hassle. An easier way is to use the taskpolicy(8) modifier utility that comes with recent versions of macOS. Though not documented in the manpage, the -d option can take the argument passive, like:

# taskpolicy -d passive dd if=...

Turn off throttling temporarily

There are a bunch of sysctls available to tune the behavior of the I/O throttling system, including one to shut it off entirely:

# sysctl debug | fgrep lowpri_throttle
debug.lowpri_throttle_max_iosize: 131072
debug.lowpri_throttle_tier1_window_msecs: 25
debug.lowpri_throttle_tier2_window_msecs: 100
debug.lowpri_throttle_tier3_window_msecs: 500
debug.lowpri_throttle_tier1_io_period_msecs: 40
debug.lowpri_throttle_tier2_io_period_msecs: 85
debug.lowpri_throttle_tier3_io_period_msecs: 200
debug.lowpri_throttle_tier1_io_period_ssd_msecs: 5
debug.lowpri_throttle_tier2_io_period_ssd_msecs: 15
debug.lowpri_throttle_tier3_io_period_ssd_msecs: 25
debug.lowpri_throttle_enabled: 1

# sysctl -w debug.lowpri_throttle_enabled=0
debug.lowpri_throttle_enabled: 1 -> 0

Amusingly, even spindump's symbolication step—which relies on an external daemon—suffers from this kind of problem—one which I diagnosed, of course, another well-timed spindump. ↩︎
To be more concrete, the number comes from the DKIOCGETBSDUNIT ioctl, implemented for disk devices by IOMediaBSDClient; the value it returns comes from IOMediaBSDClient::createNodes(). ↩︎
Technically, from 0 to LOWPRI_MAX_NUM_DEV. Devices with BSD unit numbers greater than LOWPRI_MAX_NUM_DEV get mapped down using the mod operator, so /dev/disk0 and /dev/disk64 share a throttling domain. This probably doesn't come up in practice, even if it is a little funky. ↩︎
The APFS volume manager—which has supplanted CoreStorage—in this case. ↩︎
Incidentally, this appears to be how the Disk Utility app avoids the problem, too. ↩︎