Neural Greenscreen: Recreating Virtual Zoom Backgrounds on Mac OS using CoreML and CoreMediaIO

Philipp
5 min readOct 22, 2021

Real time background replacement on a mac os driven webcam using the DeepLabV3 neural network for image segmentation and the native CoreMediaIO DAL framework of Mac OS.

The final result of the plugin that we will create.

The core idea is to create a custom webcam that is readable by the system. For this, Apple provides the CoreMediaIO framework with the Device Abstraction Layer (DAL) interface. Since the CoreMediaIO framework is very low-level, there is a lot of work to do before a simple stream can be dispatched to the plugin interface. Thankfully, other people already created solutions that we can use, here and here. Check these links out if you want to know more about setting up a basic CoreMediaIO DAL plugin. This tutorial will focus on the background replacement itself. However, the final project can be found on GitHub, if you need a plug-and-play version. What you should have already is a CVPixelBuffer from the webcam, which we can modify and submit to the hosting application. Here is an overview on what we are going to implement:

The base routine.

Let’s roll. First, we receive a webcam input pixel buffer and preload our background into a CIImage. Adjust this base code to your needs or to your own interfaces.

Next, we need to call our mask processing routine in a non-blocking way, so that the webcam stream doesn’t stutter. This is very important, since the frames are submitted to the DAL interface with a timestamp. If we perform to many operations in the main thread, our webcam could even crash. Keep that in mind. We want to outsource all time-consuming operations to a background queue. Depending on the quality of service (qos) we choose, the mask will update more or less frequently, and use more or less computational resources. I chose a userInteractive global queue.

We run a CIFilter to combine the webcam stream with the generated mask image. Because of the way how the Core Image framework performs the operations, we will need to render the generated CIImage into a newly created pixel buffer. Before rendering, the CIImage just contains our operations, so don’t attempt to convert that straight into a pixel buffer, CGImage or UIImage (which is unavailable).

Now, we only need to implement the logic to create a segmentation mask from the input webcam pixel buffer. Thankfully, the Apple CoreML model zoo already provides a model for image segmentation. It is called DeepLabV3 and provided in three different versions. We will use the optimized DeepLabV3Int8LUT model which is quantized and therefore 4 times smaller than the original model (32 bit float params vs. 8 bit int params). Don’t worry, quantization won’t impact the mask quality in a perceivable way. Depending on your device, the model could even run much faster.

We also need some attributes to keep track of our Metal pipeline which we will use to render the segmentation mask from the ML model.

Next, we will need a constructor that initializes all of our attributes.

We cannot simply use the raw camera pixel buffer as an input for the ML model. For my MacBook, its webcam has the dimensions 1280x720, but the model input size is constrained to 513x513 pixels. To approach this problem we could wrap the CoreML model in a Vision model. A major advantage of that would be the automatic resizing of our input image. However, the main reason why we are not going to use the Vision framework are its reported instabilities (as of now, fall 2021). These instabilities will prevent us from using the framework in our use case. So we will do the preprocessing ourselves. We can use a CGAffineTransform to resize the webcam pixel buffer. We will use our CIContext to render the transformed frame into a new pixel buffer.

Another problem emerges. The DeepLab model outputs an MLMultiArray which cannot be rendered directly into an image. However, we will need an image of the person masked out, so that our CIFilter in the initial routine can crop out the background. To circumvent this issue, we will create a custom Metal shader which will render the required image, and pass the MLMultiArray to the shader with a buffer. Here is how to create that buffer.

Now for the Metal shader, which we are going to use to create an image for the background removal. It will have an output texture, this is our mask that we are going to use for the background replacement later on. Additionally, the shader takes the segmentation mask buffer as an input and the current compute group. Compute groups are used to dispatch small portions of the final texture in parallelized operations. The shader maps the segmentation mask to the current pixel and outputs a white pixel if the mask predicts a person, otherwise the pixel will be black. Since the model segments the image not only into background and foreground, we will need to extract the class index of the class person. In the meta information about our model, Xcode tells us the following:

{"labels": ["background", "aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "diningTable", "dog", "horse", "motorbike", "person", "pottedPlant", "sheep", "sofa", "train", "tvOrMonitor"], "colors": [...]}

Since we want to crop out persons, we will look for the class index 15 in our segmentation mask shader. In this way, we will get an alpha mask texture for our person.

We declare three more helper functions, which will help us create a new pixel buffer for intermediate rendering, to render the Metal texture into a pixel buffer, and to dispatch the earlier mentioned compute groups.

Finally, we can create our model input from the camera image, run the CoreML inference, convert the output segmentation mask into a texture buffer, convert that with a Metal shader to a black and white mask image, and return the mask. We use a flag isBusy to only run one inference at a time. In this way, the hosting stream can always call the inference and gets an updated image when the routine has finished the previous frame.

We have successfully implemented a custom plugin that can be built and saved under /Library/CoreMediaIO/Plug-Ins/DAL/. From there, applications such as your browser can access the plugin as a camera device. Once the stream is called by the hosting applications, the plugin will continuously run model inference and replace the background with a freely choosable image.

The complete project is available on GitHub.

--

--