WWDC Highlights Part 3 – Not Banana App Using Core ML
This WWDC series highlights new technologies and insights I took away from WWDC 2017. My second post explored new drag and drop API. In this post, I demonstrate how easy it is to use the new Core ML framework.
The Silicon Valley “not hotdog” reference was used many times at WWDC while showing off Core ML. You are probably tired of the reference, but I wanted to show just how easy it is to create an app using Core ML along with a device’s camera to predict what objects are being captured. Since we all have a fondness for bananas around here, I decided to create a “Not Banana” App using Core ML – Apple’s new machine learning framework.
Let’s get started. Create a new Single View App and limit device orientation to portrait (no need for landscape in this app).
Camera Setup
Since we will be using the user’s camera, we need to start by adding a camera usage description to the app’s Info.plist.
NSCameraUsageDescription
Requires camera to detect banana
Next, we need to authorize the app to use the camera. We can create the following method that can be called from ViewController.viewDidLoad().
func authorizeCameraIfNeeded(completion: @escaping ()->()) {
let authorizationStatus = AVCaptureDevice.authorizationStatus(for: .video)
switch authorizationStatus {
case .notDetermined:
AVCaptureDevice.requestAccess(for: .video, completionHandler: { (_) in
DispatchQueue.main.async {
completion()
}
})
return
case .denied:
print("Error: Camera access denied")
break
case .restricted:
print("Error: Camera access restricted")
break
default:
break
}
completion()
}
To start displaying camera frames, we need to setup an AVCaptureSession and add the back camera as input.
let captureSession = AVCaptureSession()
func setupCameraInput() {
let deviceDiscoverySession = AVCaptureDevice.DiscoverySession(deviceTypes: [.builtInWideAngleCamera], mediaType: .video, position: .back)
if let backCamera = deviceDiscoverySession.devices.first,
let cameraInput = try? AVCaptureDeviceInput(device: backCamera) {
if self.captureSession.canAddInput(cameraInput) {
self.captureSession.addInput(cameraInput)
} else {
print("Error: Failed to add camera input")
}
} else {
print("Error: Failed to get camera")
}
}
For displaying camera output in the UI, we will use an AVCaptureVideoPreviewLayer. We can add this in ViewController.viewDidLoad() as well.
var previewLayer: AVCaptureVideoPreviewLayer!
func addPreviewLayer() {
previewLayer = AVCaptureVideoPreviewLayer(session: session)
previewLayer.videoGravity = .resizeAspectFill
previewLayer.frame = view.bounds
view.layer.insertSublayer(previewLayer, at: 0)
}
Note: If you do not want the device to go idle while using the app, you can add the following.
UIApplication.shared.isIdleTimerDisabled = true
Then, we need to start the capture session. ViewController.viewDidLoad() should look something like this.
override func viewDidLoad() {
super.viewDidLoad()
addPreviewLayer()
authorizeCameraIfNeeded {
self.setupCameraInput()
self.captureSession.startRunning()
}
}
If you launch the app on a device with a back camera, a permissions dialog should be displayed. After authorizing the app to use the camera it should display your app full screen. Now that we are displaying the camera to the user, we need to get a CVPixelBuffer from the capture session using AVCaptureVideoDataOutput.
func setupVideoOutput() {
let videoOutput = AVCaptureVideoDataOutput()
videoOutput.setSampleBufferDelegate(self, queue: DispatchQueue(label: "com.gorillalogic.samplebuffer"))
if captureSession.canAddOutput(videoOutput) {
captureSession.addOutput(videoOutput)
}
}
For this app, we will only be using one AVCaptureVideoDataOutputSampleBufferDelegate method (to grab the sample buffer output).
extension ViewController: AVCaptureVideoDataOutputSampleBufferDelegate {
func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer,
from connection: AVCaptureConnection) {
guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else {
return
}
// TODO: Do something with the pixelBuffer!
}
}
Core ML Integration
Now that we have a pixel buffer, we can start the Core ML integration. We first need a Core ML model that we will use to predict what objects are in images we pass. Apple has listed a few on its developer site. For this app, I chose the ResNet50 model. Simply download the model and drag it into the project—make sure it has been added under Target Membership. Once added, we can use it to create an objectModel that we can use for predictions.
let objectModel = Resnet50()
If we try to get the prediction directly from the pixelBuffer in our AVCaptureVideoDataOutputSampleBufferDelegate method, you will see an error in the logs. Something like “Image is not valid width 224, instead is 1920”. The solution for this is that the Resnet50 model requires a 224×224 image to be passed as the input. There are a few different ways to resize the pixel buffer. I chose to convert it to a CIImage, resize and then render it back to a CVPixelBuffer. I added this extension.
extension CIImage {
func resizedImageBuffer(to size: CGSize, context: CIContext) -> CVPixelBuffer? {
let attrs = [kCVPixelBufferCGImageCompatibilityKey: kCFBooleanTrue,
kCVPixelBufferCGBitmapContextCompatibilityKey: kCFBooleanTrue] as CFDictionary
var pixelBuffer : CVPixelBuffer?
let status = CVPixelBufferCreate(kCFAllocatorDefault,
Int(size.width),
Int(size.height),
kCVPixelFormatType_32ARGB,
attrs,
&pixelBuffer)
if (status != kCVReturnSuccess) {
return nil
}
// Note: We could optimize further, but this works for our needs
let scale = size.height / self.extent.size.height
let resizedImage = self.applying(CGAffineTransform(scaleX: scale, y: scale))
.cropping(to: CGRect(x: 0, y: 0, width: size.width, height: size.height))
.applyingOrientation(Int32(CGImagePropertyOrientation.right.rawValue))
CVPixelBufferLockBaseAddress(pixelBuffer!, CVPixelBufferLockFlags(rawValue: 0))
context.render(resizedImage, to: pixelBuffer!)
CVPixelBufferUnlockBaseAddress(pixelBuffer!, CVPixelBufferLockFlags(rawValue: 0))
return pixelBuffer
}
}
And using it in our delegate method.
let context = CIContext() // Added to ViewController
func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer,
from connection: AVCaptureConnection) {
guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else {
return
}
let inputSize = CGSize(width: 224.0, height: 224.0)
let image = CIImage(cvImageBuffer: pixelBuffer)
let resizedPixelBuffer = image.resizedImageBuffer(to: inputSize, context: context)
// TODO: Pass the resized pixel buffer to the Core ML model
}
For the final todo, we could simply print the results from the Resnet50 prediction, but since this app is all about detecting bananas, let’s add an image view that swaps between a banana image and a not banana image.
I set the banana image as the highlight image and the not banana image as the normal state. I also added a visual effect view to cover the area of the screen that is not in the 1:1 image that will be processed by the model. Finally, I created a method that updates the banana image view highlight state based on the Resnet50 prediction of the objects in a given CVPixelBuffer.
func updateBananaState(from pixelBuffer: CVPixelBuffer?) {
guard let pixelBuffer = pixelBuffer else {
return
}
let prediction = try? objectModel.prediction(image: pixelBuffer)
let isBanana = prediction?.classLabel == "banana"
DispatchQueue.main.async {
if isBanana == self.bananaImageView.isHighlighted {
return
}
self.bananaImageView.isHighlighted = isBanana
}
}
Replace the final todo.
// TODO: Pass the resized pixel buffer to the Core ML model
updateBananaState(from: resizedPixelBuffer)
Launch the app.
Pretty cool what you can do with minimal effort using Core ML! Check out these videos to learn more about using Core ML.
—