Matt Jacobson
Summary: Apple has a bug in its disk I/O throttling code that affects raw disk copies. I walk through my diagnosis.
This week, I was restoring a slow-as-molasses 2.5-inch HDD to an SSD, as part of an upgrade for my mom's work machine.
For whatever reason, the USB SATA controller I was using seems to limit me to 15 MB/s and 4 kB transfers. Not ideal when I'm transferring 500 gigs, but oh well. I can throw it in the background and come back to it later.
So, after triple-checking for any particularly day-ruining typos, I fired off:
# dd if=/dev/disk2 of=/Volumes/ExternalHDD/mom.img bs=1m
and went off to do other things on my machine.
Over the next few minutes, I started noticing my machine behaving strangely. Apps became sluggish; eventually, some practically ground to a halt. Some apps showed the spinning pinwheel wait cursor, while others stopped updating while remaining responsive to clicks. Some apps displayed nonsensical error alerts. Others apps were ostensibly unaffected and continued working fine. At one point (and very much to my frustration), even Terminal stopped responding.
Initial triage
I started by checking the obvious stuff. Memory usage was normal—dd
was correctly only using a one meg buffer. CPU utilization was high, but not outrageously so. dd
was reading from one external disk and writing to another, so there wasn't any contention for the main disk.
It was time to break out spindump
to see where things were wedged. spindump
is a super-versatile whole-system callstack profiling tool that comes with macOS; it's what macOS uses to generate those You forced Safari to quit. diagnostic reports. Unlike sample
, it can profile more than one process, and it captures the kernel backtrace, too.
I profiled various sluggish-feeling operations. It was easy to spot a pattern: lots of threads blocked in throttle_lowpri_io
—specifically, threads whose progress was required (directly or indirectly) for the app to make progress. Here's one particularly reproducible example: FaceTime hangs on launch as its main thread tries to call an Objective-C +initialize
method; at the same time, a secondary thread, holding the +initialize
lock, is stuck in throttle_lowpri_io
:
Process: FaceTime [38810]
Thread 0xabb6c DispatchQueue "com.apple.main-thread"(1)
[...]
1000 _objc_msgSend_uncached (libobjc.A.dylib)
1000 lookUpImpOrForward (libobjc.A.dylib)
1000 initializeAndMaybeRelock (libobjc.A.dylib)
1000 initializeNonMetaClass (libobjc.A.dylib)
1000 WAITING_FOR_ANOTHER_THREAD_TO_FINISH_CALLING_+initialize (libobjc.A.dylib)
1000 __psynch_cvwait (libsystem_kernel.dylib)
*1000 psynch_cvcontinue (pthread)
Thread 0xabb8f DispatchQueue "com.apple.FaceTime.FTRecentsController.serialQueue"(177)
[...]
839 lookUpImpOrForward (libobjc.A.dylib)
839 initializeAndMaybeRelock (libobjc.A.dylib)
839 initializeNonMetaClass (libobjc.A.dylib)
839 CALLING_SOME_+initialize_METHOD (libobjc.A.dylib)
839 +[SGSuggestionsService initialize] (CoreSuggestions)
839 +[NSDictionary dictionaryWithObjects:forKeys:count:] (CoreFoundation)
839 __NSDictionaryI_new (CoreFoundation)
643 __CFStringHash (CoreFoundation)
*643 hndl_alltraps (kernel)
*643 user_trap (kernel)
*643 <unsymbolicated> (kernel)
*643 throttle_lowpri_io (kernel)
*643 thread_block_reason (kernel)
In some cases, the trail of thread dependencies crosses process boundaries. spindump
usually does a good job of telling you what thread to look at next. Here's Messages hanging on launch:
Process: Messages [39192]
Thread 0xae38d DispatchQueue "com.apple.main-thread"(1)
[...]
1001 +[ABAddressBook sharedAddressBook] (AddressBookCore)
1001 _AB_Lock (AddressBookCore)
1001 _pthread_mutex_firstfit_lock_slow (libsystem_pthread.dylib)
1001 __psynch_mutexwait (libsystem_kernel.dylib)
*1001 psynch_mtxcontinue (pthread) (blocked by turnstile with priority 47 waiting for Messages [39192] thread 0xae39e)
Thread 0xae39e DispatchQueue "com.apple.root.background-qos"(6)
[...]
848 -[ABXPCACAccountStore contactsAccountsWithFetchOptions:] (AddressBookCore)
848 _CF_forwarding_prep_0 (CoreFoundation)
848 ___forwarding___ (CoreFoundation)
848 -[NSXPCConnection _sendInvocation:withProxy:] (Foundation)
848 -[NSXPCConnection _sendInvocation:orArguments:count:methodSignature:selector:withProxy:] (Foundation)
848 __NSXPCCONNECTION_IS_WAITING_FOR_A_SYNCHRONOUS_REPLY__ (Foundation)
848 xpc_connection_send_message_with_reply_sync (libxpc.dylib)
848 dispatch_mach_send_with_result_and_wait_for_reply (libdispatch.dylib)
848 _dispatch_mach_send_and_wait_for_reply (libdispatch.dylib)
848 mach_msg_trap (libsystem_kernel.dylib)
*848 ipc_mqueue_receive_continue (kernel) (blocked by turnstile waiting for ContactsAccountsService [38314] thread 0xae24d after 2 hops)
Process: ContactsAccountsService [38314]
Thread 0xae24d DispatchQueue "com.apple.tcc.preflight.kTCCServiceAddressBook"(66)
[...]
997 __TCCAccessRequest_block_invoke.96 (TCC)
997 tccd_send_message (TCC)
997 xpc_connection_send_message_with_reply_sync (libxpc.dylib)
997 dispatch_mach_send_with_result_and_wait_for_reply (libdispatch.dylib)
997 _dispatch_mach_send_and_wait_for_reply (libdispatch.dylib)
997 mach_msg_trap (libsystem_kernel.dylib)
*848 ipc_mqueue_receive_continue (kernel) (blocked by turnstile waiting for tccd [38170] thread 0xae3f6 after 2 hops)
*149 ipc_mqueue_receive_continue (kernel) (blocked by turnstile waiting for tccd [38170] thread 0xae0a7 after 2 hops)
Process: tccd [38170]
Thread 0xae0a7,0xae3f6 DispatchQueue "com.apple.root.default-qos.overcommit"(11)
[...]
1001 Security::MachO::dataAt (Security)
1001 pread (libsystem_kernel.dylib)
*1001 hndl_unix_scall64 (kernel)
*1001 unix_syscall64 (kernel)
*1001 throttle_lowpri_io (kernel)
*1001 thread_block_reason (kernel)
To simplify, we have this mess:
Messages main thread
↳ blocked on Messages secondary thread
↳ blocked on ContactsAccountService
↳ blocked on tccd
On Darwin, threads block inside throttle_lowpri_io
when they're being artificially delayed to slow down their I/O operations, with the ultimate goal of optimizing the performance of higher-priority I/O. And indeed, in both of these cases (and in the other similar problems I saw), the chain of blockage ultimately leads to a thread with less-than-highest I/O priority.
Once you start looking, it's not hard to find dozens of examples of this just through normal usage of different apps.[1] Why so many UI commands on macOS are blocking on low-priority threads is less a technical question than one of Apple's institutional priorities. (A question that I have lots to say about, but not here.)
But I was interested in a different question. Ordinarily, I/O operations on one disk device don't throttle those on another device. This makes sense, because multiple devices can operate independently without contention. Yet in this case, all of the throttled I/O operations were to the main disk (containing the root and user data volumes), while my dd
was copying from one external disk to another. There was no possibility of contention. So why was throttling still happening?
Throttling domains
To keep track of which I/Os should be throttled, the Darwin kernel maintains what I'll call throttling domains (the source calls them struct _throttle_io_info_t
). In rough terms, each throttling domain is meant to correspond one-to-one to a disk device.
When an I/O is issued through the spec_strategy
routine, the kernel has to determine which throttling domain the operation lives in, so that the operation may either be throttled or cause throttling of lower-priority operations. The throttling domain is determined first by taking the vnode (i.e., file) the I/O is being done to and walking up to its enclosing mount_t
.
From there, the code looks at the mount's mnt_devbsdunit
property. The mnt_devbsdunit
describes the "disk number" of the device the filesystem lives on. If a filesystem is mounted from /dev/disk3
, then the mount's mnt_devbsdunit
is 3
. If the backing disk is actually a partition of a disk, then the number comes from the whole disk, not the partition; e.g., /dev/disk3s2
results in 3
.[2]
The mnt_devbsdunit
—which can range from 0 to 63[3]—determines which throttling domain is in play.
I find it useful to back up a theory with an example. One good way here is to instrument the kernel with (the recently neglected) dtrace. The following dtrace script triggers on spec_strategy
calls and logs out the vnode, mount point, and computed mdt_devbsdunit
:
fbt::spec_strategy:entry
{
this->strategy_ap = (struct vnop_strategy_args *)arg0;
this->bp = this->strategy_ap->a_bp;
this->vp = this->bp->b_vp;
this->mp = this->vp->v_mount;
this->unit = this->mp ? this->mp->mnt_devbsdunit : 0;
printf("strategy: vp=%p (%s), mp=%p (%s), unit=%u",
this->vp, stringof(this->vp->v_name),
this->mp, this->mp ? stringof(this->mp->mnt_vfsstat.f_mntonname) : "???",
this->unit);
}
Here's an example of what that script outputs:
strategy: vp=ffffff802d78e400 (mom.img), mp=ffffff801fc65a20 (/Volumes/ExternalHDD), unit=3
This says that a process is doing I/O to read or write a file named "mom.img", which lives in the mountpoint /Volumes/ExternalHDD
. Here's what mount(8)
tells us about that mountpoint:
$ mount | fgrep /Volumes/ExternalHDD
/dev/disk3s2 on /Volumes/ExternalHDD (hfs, local, nodev, nosuid, journaled, noowners)
The 3
in disk3s2
matches the unit=3
in the dtrace output, as we expect.
But here's a second example:
strategy: vp=ffffff8029027e00 (History.db), mp=ffffff801fc67880 (/System/Volumes/Data), unit=0
$ mount | fgrep /System/Volumes/Data
/dev/disk1s1 on /System/Volumes/Data (apfs, local, journaled, nobrowse)
Here, the 1
in disk1s1
does not match the unit=0
from dtrace. Why?
Logical volume groups and mnt_throttle_mask
Apple added a logical volume manager, called CoreStorage, to Mac OS X Lion. In contrast to traditional disk partitions, in which a contiguous range of a disk device is used as a volume, CoreStorage allows a looser relationship between volumes and backing storage. For instance, a volume might use storage from multiple different disk devices—witness Fusion Drive for example.
This complicates the mnt_devbsdunit
situation. Suppose a filesystem is mounted from volume disk2
. According to the previous rules, mnt_devbsdunit
is 2
. However, disk2
might be a CoreStorage logical volume, backed by the real disk devices disk0
and disk1
.
Moreover, CoreStorage might not be the only user of disk0
and disk1
. Suppose further a second, non-CoreStorage volume on disk0
, called disk0s3
. I/Os to disk2
and disk0s3
may contend with each other. But the mnt_devbsdunit
of disk0s3
is 0
, so the two mounts will be in different throttling domains.
To solve this, enter a second mount_t
field, mnt_throttle_mask
. mnt_throttle_mask
is a 64-bit bit array. A bit is set only when I/Os to the mount may involve the correspondingly numbered disk device. For our CoreStorage logical volume disk2
, since disk0
and disk1
are included, bits 0 and 1 are set. Bit 2 is also set for the logical volume itself, so the overall mask is 0x7
.
In theory, you might imagine a system wherein a mount could reside in multiple throttling domains. Or perhaps the throttling domain decision could be pushed down so that CoreStorage could help make smart decisions about which to use for a particular I/O operation.
The implemented reality is much more mundane. mnt_devbsdunit
is set to the index of the lowest bit set in mnt_throttle_mask
. For disk2
, since bit 0 is set, mnt_devbsdunit
is 0. So disk2
and disk0s3
live in the same throttling domain (though, notably, a theoretical disk1s3
would not).
This explains what's happening with /System/Volumes/Data
above. disk1s1
is a logical volume presented by a volume manager[4], and its backing storage is on disk0
. Tweaking the dtrace script shows that mnt_throttle_mask
is 0x3
:
strategy: vp=ffffff8029027e00 (History.db), mp=ffffff801fc67880 (/System/Volumes/Data), mask=3, unit=0
devfs weirdness
Popping the stack back to the original problem, it's now interesting to look at the I/Os done by dd
through the lens of the dtrace script. This is what they look like:
strategy: vp=ffffff802c603600 (disk2), mp=ffffff801fc696e0 (/dev), mask=3f, unit=0
Notice first that spec_strategy
is being asked to do an I/O on the file /dev/disk2
, not a regular file as before. Though unusual, this makes sense: no filesystem is actually mounted from the disk2
device, and dd
is explicitly attempting to read the disk2
special file itself.
As a side effect, the mount point is deduced to be /dev/
. Again, this is unusual, but it seems to be the least unreasonable option: there is nothing mounted from disk2
itself, and /dev/
is the mount point that holds the disk2
special file.
This is where things go weird, though. dtrace reports the mnt_throttle_mask
of /dev/
to be 0x3f
. In other words, /dev/
claims to be similar to a logical volume made up of exactly these disks: disk0
, disk1
, disk2
, disk3
, disk4
, and disk5
.
Never mind that there are no disk4
or disk5
attached to my system. What on earth would this even mean? /dev/
is the mount point of a synthetic filesystem (devfs
) of special files. Sure—some of those special files do indeed correspond to disk devices. But others correspond to non-disk devices. Still others, like /dev/null
, are completely fabricated by software.
And, perhaps most curiously, why does the list stop at disk5
?
This question is perhaps best answered by imagining what a reasonable value of mnt_devbsdunit
would be for devfs. In an ideal world, perhaps each vnode in devfs might be assigned a throttling domain independently, such that /dev/disk0
lived in the 0
domain, /dev/disk1
in the 1
domain, etc.
Unfortunately, the reality of the design allows us to assign only a single mnt_devbsdunit
for all of devfs. So a reasonable, if far from ideal, solution is to assign 63
—a value that will put devfs in its own throttling domain, as long as fewer than 64 disk devices are attached.
The bug
Assigning 63
is in fact what the code did, prior to Lion. When a mount is created, backed by a device vnode are assigned the BSD unit value of the vnode:
if (device_vnode != NULL) {
VNOP_IOCTL(device_vnode, DKIOCGETBSDUNIT, (caddr_t)&mp->mnt_devbsdunit, 0, NULL);
mp->mnt_devbsdunit %= LOWPRI_MAX_NUM_DEV /* 64 */;
}
All others, like devfs, are assigned a backstop value:
mp->mnt_devbsdunit = LOWPRI_MAX_NUM_DEV /* 64 */ - 1;
Unfortunately, when mnt_throttle_mask
was introduced in Lion, the backstop value was changed to:
mp->mnt_throttle_mask = LOWPRI_MAX_NUM_DEV /* 64 */ - 1;
mp->mnt_devbsdunit = 0;
This seems wrong! First, remember that mnt_throttle_mask
is a 64-bit bit array, whereas mnt_devbsdunit
is an ordinal. There are various backstop values that might make sense for mnt_throttle_mask
: 0
(all bits cleared) or ~0
(all bits set) are two obvious candidates.
But LOWPRI_MAX_NUM_DEV - 1
—in other words, 63
or 0x3f
—is not one of them, and it's pretty clear that the value was incorrectly copied from the old fallback initialization of mnt_devbsdunit
.
Second, and more importantly, the backstop value of mnt_devbsdunit
switched from 63
to 0
. 0
is indeed the "correct" value corresponding to a mask of 0x3f
, but such a change puts all of devfs in the same throttling domain as disk0
.
To preserve the old behavior, I'd suggest these backstop values instead be: 1ULL << 63
for the mask, and—correspondingly—the old value of 63
for mnt_devbsdunit
.
A few workarounds
Use the character device
Each disk device is represented with two special files: a block device (e.g., /dev/disk0
, like the ones I've been using above) and a character device (e.g., /dev/rdisk0
). I/O operations to the block device go through the buffer cache and therefore through spec_strategy
, as outlined above. I/O operations on the character device, however, bypass the buffer cache and spec_strategy
entirely.
Notably, though, I/Os on the character device don't bypass throttling. For I/O on a character device, there's special code to determine the correct throttling domain. But since no mount points are involved here, the throttling domain is determined based on the vnode alone. This is pretty much exactly what we want (and what we couldn't do when we had to assign a single throttling domain to all of devfs).
For some use cases, the fact that the character device bypasses the buffer cache could be a problem. Otherwise, this seems like the optimal solution.[5]
Use IOPOL_PASSIVE
In addition to assigning a priority tier to its I/O operations, a process may mark its I/O as passive; passive I/O may be throttled but doesn't cause throttling of other I/Os.
Recompiling dd
to call setiopolicy_np(3)
would be a hassle. An easier way is to use the taskpolicy(8)
modifier utility that comes with recent versions of macOS. Though not documented in the manpage, the -d
option can take the argument passive
, like:
# taskpolicy -d passive dd if=...
Turn off throttling temporarily
There are a bunch of sysctls available to tune the behavior of the I/O throttling system, including one to shut it off entirely:
# sysctl debug | fgrep lowpri_throttle
debug.lowpri_throttle_max_iosize: 131072
debug.lowpri_throttle_tier1_window_msecs: 25
debug.lowpri_throttle_tier2_window_msecs: 100
debug.lowpri_throttle_tier3_window_msecs: 500
debug.lowpri_throttle_tier1_io_period_msecs: 40
debug.lowpri_throttle_tier2_io_period_msecs: 85
debug.lowpri_throttle_tier3_io_period_msecs: 200
debug.lowpri_throttle_tier1_io_period_ssd_msecs: 5
debug.lowpri_throttle_tier2_io_period_ssd_msecs: 15
debug.lowpri_throttle_tier3_io_period_ssd_msecs: 25
debug.lowpri_throttle_enabled: 1
# sysctl -w debug.lowpri_throttle_enabled=0
debug.lowpri_throttle_enabled: 1 -> 0
-
Amusingly, even
spindump
's symbolication step—which relies on an external daemon—suffers from this kind of problem—one which I diagnosed, of course, another well-timed spindump. ↩︎ -
To be more concrete, the number comes from the
DKIOCGETBSDUNIT
ioctl, implemented for disk devices by IOMediaBSDClient; the value it returns comes fromIOMediaBSDClient::createNodes()
. ↩︎ -
Technically, from 0 to
LOWPRI_MAX_NUM_DEV
. Devices with BSD unit numbers greater thanLOWPRI_MAX_NUM_DEV
get mapped down using the mod operator, so/dev/disk0
and/dev/disk64
share a throttling domain. This probably doesn't come up in practice, even if it is a little funky. ↩︎ -
The APFS volume manager—which has supplanted CoreStorage—in this case. ↩︎
-
Incidentally, this appears to be how the Disk Utility app avoids the problem, too. ↩︎