Homebrew home surveillance camera

Matt Jacobson
November 2022

I wouldn't consider myself a home surveillance nut, but there are some times I wished I could look in at my house remotely. Unfortunately, as far as I can tell, all of the network surveillance cameras on the market either:

are known to be completely insecure,
should reasonably be assumed to be completely insecure, or
are insanely expensive and presumably targeted at industrial customers (and probably some combination of (1) and (2) anyway, for good measure)

Also, most of the non-insanely-expensive options seem poorly composable—either they don't use a standard streaming protocol, or they require use of their own iPhone app to operate. Most of them hint at uploading data to manufacturer-controlled remote servers. Some of them brag about their (Macromedia) Flash-based control UI.

So, over the summer, I set out to build my own home camera system. It took a bit of work to get the various parts to work just the way I wanted, and I'm quite happy with the result, so I'm documenting it here.

Hardware

A USB webcam. I chose the Logitech C310, because it was reasonably cheap, came from a well known manufacturer, and was known to work on FreeBSD.
A FreeBSD machine, though I suspect almost all of this would work on a typical Linux distribution, too. Helps if you already have way too many of these lying around the house. I used an old freebie hand-me-down.

(Total cost of hardware: around $30.)

Basic HLS streaming

The first step is to install and enable webcamd. webcamd is an encapsulation of some ported-from-Linux webcam drivers (including USB VDC), running in userland and publishing device nodes via CUSE. I just took the version from pkg. It requires you to enable the /boot/kernel/cuse.ko kernel module, which is already installed as part of the FreeBSD base system. To do so, add this to /boot/loader.conf:

cuse_load="YES"

and/or immediately load it with

# kldload cuse

Similarly, enable webcamd in your rc.conf:

webcamd_enable=YES

and start it with

$ service webcamd start

Once that's running, you should have enough to plug in the webcam and stream from it. Since /dev/video0 isn't a simple device file that you can just read from (trying to cat it gives EINVAL), the easiest thing is to go ahead and install ffmpeg, which knows how to issue the right ioctls (as defined by the Video4Linux (V4L) interface) to actually read frames.

You'll eventually want ffmpeg-5.1 or newer (to pick up a bugfix described below), and you'll also want to build ffmpeg from source. But for initial testing, it's good enough to install the prebuilt ports package (as of this writing, version 4.4.3). Then you can do something like:

# ffmpeg -f v4l2 -i /dev/video0 -f mp4 -c:v libx264 /tmp/capture.mp4

Let that run for a few seconds. Then ship that file over to any modern video player app and view it.

To add sound, you'll have to figure out the sound device corresponding to the webcam's mic. Sound devices in FreeBSD are governed by the Open Sound System, which seems to be another Linux-originated thing. Each sound I/O device is given a /dev/dspX.Y node, where X is the number of the device and Y is the channel within the device. More details in snd(4). Devices and channels can be interrogated by catting /dev/sndstat; setting sysctl hw.snd.verbose to 4 first will give you the most info.^[1]

In my case, the mic is at /dev/dsp2.0, so I can add audio to my ffmpeg capture like this:

$ ffmpeg -f v4l2 -i /dev/video0 \
         -f oss  -i /dev/dsp2.0 \
         -f mp4 -c:v libx264 -c:a aac /tmp/capture.mp4

Note that ffmpeg has this (initially) weird command line syntax scheme that looks something like this:

$ ffmpeg [input1 options] -i input1
         [input2 options] -i input2
         ...
         [output1 options] output1
         [output2 options] output2
         ...

I find it's useful to organize the command into lines like I have above, or I quickly lose track of what options belong to what.

Anyway, at least on my machine, simply reading those two inputs like that doesn't work very well. The resulting capture is super stuttery, and ffmpeg emits lots of errors about Non-monotonous DTS in output stream.^[2] Marginally more helpfully, it also suggests a fix:

[video4linux2,v4l2 @ 0x80ee68000] Thread message queue blocking; consider raising the thread_queue_size option (current value: 8)
[oss @ 0x80ee68600] Thread message queue blocking; consider raising the thread_queue_size option (current value: 8)

According to the documentation for thread_queue_size:

This option sets the maximum number of queued packets when reading from the file or device. With low latency / high rate live streams, packets may be discarded if they are not read in a timely manner; setting this value can force ffmpeg to use a separate input thread and read packets as soon as they arrive. By default ffmpeg only does this if multiple inputs are specified.

This makes intuitive sense to me. Having a bunch of slack between the threads pulling frames off out of the drivers and the thread encoding the output allows for slight hiccups (due to, for example, the non-uniform nature of video compression) to be handled gracefully by the system.

And indeed, adding -thread_queue_size 4096 to both inputs made the stutteriness disappear.

Streaming with HLS

The modern way to stream a live video feed is with HTTP Live Streaming. All the modern browsers and video players and video frameworks support it, and, by the low bar of multimedia standards, it's pretty simple and well documented.

An HLS stream is made up of a series of short audio/video segment files, usually in the MPEG-TS container format, and an M3U8 playlist file that describes them. All the files are served up over a plain old web server; the client simply fetches the playlist file periodically to find out about new segments and downloads them.

This has all sorts of benefits here, most of all that you can take free advantage of the various things that web servers and browsers already do. Want your video stream to be password-protected? Use HTTP basic authentication, or any other normal means of putting auth around HTTP-served content. Need encryption? Enable HTTPS, using e.g. Let's Encrypt.

ffmpeg has built-in support (a muxer, in its parlance) for generating the files for an HLS stream. Here's the basic configuration I'm using:

ffmpeg [input options ...]
    -y -hls_time 2 -hls_list_size 6 -hls_flags delete_segments path/to/stream/stream.m3u8

In a bare-bones setup like what I'm aiming for, path/to/stream/stream.m3u8 can simply be a path somewhere inside my HTTP server's web root. The MPEG-TS segments will correspondingly be placed at path/to/stream/stream01.ts, etc.

-hls_time 2 says that each segment is two seconds long. -hls_list_size 6 says that the last six segments should be kept around; -hls_flags delete_segments tells ffmpeg to delete any older segments after they've been removed from the playlist. You can tweak these values to adjust the balance between latency and server load: larger and more segments means higher latency. Fuller command-line documentation here.

Once the HLS stream is set up, you can create a dead-simple HTML page with a <video> tag pointing at it.

Other options for consuming the HLS stream include entering the m3u8 URL into a media player app (like QuickTime Player or VLC)—or using the underlying media playback APIs provided by your OS, like AVPlayerView on macOS.

Timestamping

Hopefully I'll have little need to use this system for forensic purposes, but in case I do, it would be nice to have timestamps embedded into the video.

ffmpeg's libavfilter is a filter graph system that you can insert into an ffmpeg media pipeline, and it includes a filter for drawing text on video frames: drawtext.

To use it, your build of ffmpeg will need to have been configured with --enable-libfreetype, which also in turn requires --enable-gpl. (See below for my full configure invocation for ffmpeg.) That might mean that you need to rebuild ffmpeg from source.

Once you have all the right stuff installed, you can specify the filter in the arguments to ffmpeg:

$ ffmpeg -f v4l2 -i /dev/video0 \
         -f oss  -i /dev/dsp2.0 \
         -filter:v "drawtext=fontcolor=yellow:fontfile=${FONT_FILE}:text=%{localtime}:x=10:y=10:shadowx=1:shadowy=1"
         -f mp4 -c:v libx264 -c:a aac /tmp/capture.mp4

${FONT_FILE} should be a path to a TrueType or OpenType (or anything else supported by FreeType) font file. Make sure it doesn't contain any colons, or equal signs, or percent signs, I guess. (I'm using a font installed by the dejavu port.) The syntax for the filter parameter is documented here.

Nothing fancy, but it does the job. I'll note that the offset shadow (no blur radius, AFAICT) helps the text remain legible on both light and dark backgrounds.

Motion detection, first attempt

If having an Internet-accessible webcam is all you need, you can stop here. However, I was also interested in employing some sort of motion detection to capture "interesting" clips for review later.

I initially wasted a bunch of time trying to get the (poorly named) Motion Project to work. As far as I can tell, it's essentially the only option out there for open-source, programmable video motion detection. I initially wrote a long diatribe about how awful this thing is, but for your sake I'm editing it out in favor of a summary. It's poorly documented. It doesn't support modern standards. It's inflexible.

Oh—and it doesn't support audio. Lest you think that this is some correctable oversight rather than chronic dysfunction, the maintainers are happy to disabuse you of that notion. They ramble on incomprehensibly about the "legal implications (covert recording of audio is not permitted in many countries)" and how they "[believe] in making sure the application keeps the focus on its core functionality of video". 🙄

Building my own

So after spending days trying to get Motion to do what I wanted, I basically burned out on the entire idea and put it aside for a while. I knew that what I was trying to do was not intrinsically difficult, so I was mostly just waiting to work up the courage and momentum to throw away some hard-earned progress and start on a new approach.

Eventually I encountered a fount of motivation and spent a bunch of late nights creating my own motion detector. I'm calling the result: sophie.

sophie is a simple motion detection daemon built on FFmpeg. It supports any video stream FFmpeg does, including of course HLS. It supports saving clips with audio. Since it's my own code, things like "what areas to detect motion in" and "hysteresis" and "clip history" are configurable to my heart's content.

The motion detection scheme is dead simple. The frames come in as YUV; I just diff the Y channel of temporally adjacent frames. Three "filters", if you can call them that, are applied to the result to reduce false positives:

Samples are counted as "different" if their delta exceeds some threshold. This discards sensor noise.
Frames are counted as "interesting" if enough samples in the frame are "different", to discard small bits of motion (e.g., leaves fluttering in the breeze).
Motion is detected when X out the last Y frames are "interesting", to discard transient dazzle (e.g., specular reflection off of a moving car).

I tuned all of the various values experimentally until I was happy with the results.

The daemon a set number of video and audio samples in a rolling buffer. When motion is detected, the rolling buffer is encoded to a new output file, and all incoming video and audio samples are also encoded until motion ends (and a hysteresis timer expires). It also saves out a still frame to disk.

When a motion event occurs, sophie can also be configured to invoke a command, passing information about the event as arguments. I have mine set up to run a shell script that sends me e-mail. The still frame is encoded into the message as an attachment, and a link to the video (served by Apache running on the same machine) is provided for optional viewing. I set up my phone to notify me about mail from the special address I used, and just like that I had a cheap push notification.

Sophie is open-source. It's not intentionally inflexible, and (unlike some of my other projects) I could see it being useful to others. So far, though, I've made almost no specific attempt to make that easy. In particular, I've only tested it on macOS and FreeBSD, and only with my own HLS streams. I'm happy to answer questions about it in e-mail.

Odds and ends

FFmpeg OSS timestamp bug

One of my programming rules of thumb is this: if you're not debugging both your callers and callees (in addition to your own code), then you're not doing your job. Using FFmpeg is a great lesson in why this is needed. I spent a lot of time stepping through its FFmpeg code and ultimately learned quite a big about its design. (This highlights another benefit of building your own FFmpeg libraries: having usable debug symbols.)

Surprisingly, I found a really longstanding, yet super simple bug in the Open Sound System audio input code. Incoming audio buffers are timestamped with the time of their first sample. To compute this, FFmpeg subtracts the buffer's "duration" from the time it was received from the kernel. The duration is just its sample count divided by the sample rate.

The bug was that the code didn't properly account for the fact that each sample is two bytes wide. This meant that each buffer timestamp was a bit earlier than it should have been. Normally this was pretty innocuous, but since the buffer sizes were also variable, you'd eventually run into a situation where the timestamps of consecutive buffers would go in reverse.

The fix was easy, and the FFmpeg maintainers were nice enough to take my patch, which is included in FFmpeg-5.1 and newer.

Profiling

sophie is reasonably performance-sensitive. The motion detection code needs to run in real time, or it'll get behind, and eventually it'll miss a chunk of the video. (Recall that the HLS feed only keeps around 6 × 2 = 12 seconds of history.) That could also in turn cause it to detect spuriously.

Also, when it does detect motion, the code it runs in response needs to be fast enough that it doesn't start dropping frames at that point. That would of course be the worst time to start dropping frames.

I wrote in a previous entry about drspin, a sampling profiler for FreeBSD. I was motivated to write drspin to help tune the performance of this project, and it turned out to be quite useful.

Building ffmpeg

Here are my configure invocations for building FFmpeg:

FreeBSD:

$ ./configure --cc=cc --enable-libv4l2 --enable-libx264 --enable-gpl --enable-nonfree --enable-libfreetype --enable-filter=drawtext --enable-openssl --extra-cflags=-g --disable-stripping --enable-optimizations --enable-shared

macOS:

$ ./configure --cc=cc --enable-securetransport --extra-cflags=-g --disable-stripping --disable-optimizations --enable-shared

Future directions

It would be nice to define some kind of configuration file format, hopefully without falling into the trap of overengineering it and designing a bazillion incoherent knobs.
I'd be interested to try to do some sort of analysis on the audio channel, perhaps to detect "motion" when loud sounds occur even in the absence of any visible motion.
I have a couple old Raspberry Pi boards laying around doing nothing. It would be nice to see if I can cross-compile an ARM version of this to run there, instead of having it run on a noisy old Intel box.

If, like me, you like to poke around in /dev/ to find things that might be what you're looking for, that won't work here. The sound device code has a hook into namei that lets it create the device node lazily, so you really need to know up front what you're looking for. This is pretty annoying, IMO. ↩︎
Yes, I obviously filed a bug about the unintentionally ironic wording, and it predictably went right over their heads. ↩︎