Alright Mr. Demille, I'm ready for my closeup!

Detecting close-ups in an effort to analyze film language.

The main goal here is to generate a piece of software, that when fed a video, detects closeups and will generate a video of only the closeups. I knew Python was my best bet regarding library availability to quickly snap together a prototype, I ended up using Hy as well, as I've become fond of it. A few months ago, I tried doing this with a Viola-Jones detector and the results were not satisfactory, on top of being slow, the need to be front-facing meant too many close-ups went undetected.

Enter MTCNN, Multi-task Cascaded Convolutional Networks, this was a game changer as it was able to detect faces at more angles and under more lighting conditions in exchange for slightly more false positives. My first test with this was much more accurate but also far slower than the already too-slow Viola-Jones detector. I finally came across Py Torch's implementation of MTCNN that had CUDA support, and finally a viable option for both accuracy and speed presented itself.

The first pass at the code itself was straightforward, it takes the frame size, takes the captured face size, and compares their areas with a THRESHOLD, initially set at .087 after some experimentation. We take advantage of MTCNN's default behavior of returning the largest face first, so if any faces are found and we're looking for closeups, the first entry is our contender, saving us a few draw calls.

(setv THRESHOLD .087)
(defn detect-closeup [frame]
  (let [(, height width channels) frame.shape         
        detections (.detect mtcnn frame)
        (, bounding-boxes conf) detections
        has-closeup? (if (not (is bounding-boxes None))
                         (let [(, x1 y1 x2 y2) (get bounding-boxes 0)
                               area (* (- x2 x1) (- y2 y1))
                               exceeds-threshold? (> (/ area (* width height)) THRESHOLD)]
                               (.rectangle cv2
                                           (, (int x1) (int y1))
                                           (, (int x2) (int y2))
                                           (if exceeds-threshold? (, 0 255 0) (, 0 0 255))

Mannequin, cat, photograph, spy

.087 acts as the threshold, the minimum amount of a frame that must be occupied by a face to act as a closeup. Another area that's worth pointing out is the resizing that occurs. Not much frame size is needed, as even postage stamp-sized, a close-up makes for a highly detectable face.

Here is the result of running the film "Diamonds are Forever" through it:

Some of the false positives were interesting, a cat for example, while at first might seem like an obvious misfire then drives the question, "Does a closeup of a cat count and for that matter a mannequin or an insert of a photograph of a closeup?"

When running a feature length film through the program some of the limitations of MTCNN became more apparent. The software struggled with faces that were clipped by the edge of the frame, as well as when a character is touching their face. Drastic angles and lighting also proved to be major pitfalls. As I monitored the program there were some obvious closeups going by undetected, or more common, popping in and out detection, this was more problematic in "Live and Let Die".

I approached this by adding 'striding' to the application, this approach was taken from a FastMTCNN example online, the idea being that there is a high likelihood that the frame after a close-up is also close-up (up until a stride value) and hasn't moved much. This saves you from detecting faces in the in-between frames. The speed gains weren't worth the number of tail-frames appended to close-ups that didn't align perfectly with a stride value. I've left the code in however, but the stride defaults to zero. A better approach would be to backtrack and fill fleeting non-closeups. I also took this opportunity to separate detecting-closeups from drawing previews and rendering as well as pushing the writing of the frames into another thread.

This is a space I want to keep exploring and have some thoughts towards improvements, but I think it will remain flawed. The percentage of a frame occupied by a detected face just doesn't do justice to what it or isn't a closeup. Often, opposing shots are matched in their specifications: the camera height, the camera distance to the subject and the lens, because one subject is physically smaller, is their shot any less a closeup? Or what if a frame fills that area but serves more as the foreground to a medium or wide shot of the protagonist? All this without yet a mention of the technical limitations themselves in capturing a face under the many conditions a film presents them.

The "data"

For all the reasons already mentioned take the following with the grain of sand, but what's the point without some analysis 😅.

Both films, "Diamonds are Forever" and "Live and Let Die", were directed by Guy Hamilton, two years apart, in 1971 and 1973, respectively and represent the transition between two of the most famous Bonds, Sean Connery and Roger Moore and vastly different tones.

"Diamonds are Forever" is 02:00:08 (represented as HH:MM:SS, not timecode) long, its closeup extraction had a duration of 00:02:55. Approximately 2.43% of the film is a closeup.

"Live and Let Die" ran only slightly longer at 02:01:38, and its closeup extraction nearly doubled the other film at a runtime of 00:05:46. 4.74% of the film here is a closeup. That said, a variance of ~2% isn't massive, and in many ways should be expected: the same director, the same genre, only slight deviations in visual language.

The code is ready for its closeup.

For those interested in the implementation itself here's the code. Keep in mind Hy is under active development so this may only apply to version 1.0a4.

(import cv2)
(import sys)
(import torch)
(import numpy)
(import asyncio)
(import facenet_pytorch [MTCNN])
(import contextvars)

(setv out (.ContextVar contextvars "video output"))

(setv device (if (.is_available torch.cuda) "cuda" "cpu"))
(print device)
(setv mtcnn (MTCNN :device device))
(setv fourcc (.VideoWriter_fourcc cv2 "M" "J" "P" "G"))

(setv THRESHOLD .087)
(setv SCALE .25)
(setv STRIDE 0)

(defn/a async-video-writer [queue]
  (let [frame (await (.get queue))]
       (.write (.get out) frame)
       (.task_done queue)))

(defn detect-closeup [frame]
  (let [(, height width channels) frame.shape         
       detections (.detect mtcnn frame)
       (, bounding-boxes conf) detections
       result (if (is bounding-boxes None)
                  (let [(, x1 y1 x2 y2) (get bounding-boxes 0)
                       area (* (- x2 x1) (- y2 y1))
                       coverage (/ area (* width height))
                       exceeds-threshold? (> coverage THRESHOLD)]
                       (if exceeds-threshold? bounding-boxes None)))]

(defn preview-frame-with-bounds [frame bounding-boxes]
      (if (not (is bounding-boxes None))
          (let [(, height width channels) frame.shape  
               (, x1 y1 x2 y2) (get bounding-boxes 0)
               area (* (- x2 x1) (- y2 y1))
               coverage (/ area (* width height))
               exceeds-threshold? (> coverage THRESHOLD)]
               (.rectangle cv2 frame
                           (, (int x1) (int y1))
                           (, (int x2) (int y2))
                           (if exceeds-threshold? (, 0 255 0) (, 0 0 255)) 2)))
      (.imshow cv2 "Frame" frame))

(defn/a process-file [file]
  (let [cap (.VideoCapture cv2 file)
       isOpened (.isOpened cap)
       queue (.Queue asyncio) ;; Queue that holds the workload
       tasks []
       stride-counter 0
       active-bounds None]
       (.set out (.VideoWriter cv2 "out.avi" fourcc 23.98 (, (int (.get cap 3)) (int (.get cap 4))) True))
       (while isOpened
         (print stride-counter)
         (let [(, ret frame) (.read cap)]       
              (if ret
                  (let [lo-res-frame (.resize cv2 frame (, 0 0) :fx SCALE :fy SCALE)
                       closeup-bounds (if (or (= stride-counter 0) (>= stride-counter STRIDE))
                                          (let [closeup-bounds (detect-closeup lo-res-frame)]
                                               (if (not (is closeup-bounds None))
                                                    (if (= stride-counter 0) (setv active-bounds closeup-bounds))
                                                    (if (< stride-counter STRIDE) (+= stride-counter 1) (setv stride-counter 0)))
                                                   (do (setv stride-counter 0)
                                                       (setv active-bounds None)))
                                           (+= stride-counter 1)
                       (preview-frame-with-bounds lo-res-frame active-bounds)
                       (if (not (is closeup-bounds None)
                                 (.cvtColor cv2 frame cv2.COLOR_BGR2RGB)
                                 (.put_nowait queue frame)
                                  (.append tasks (.create_task asyncio (async-video-writer queue))))))

                       (.waitKey cv2 1))
       (await (.join queue)) ;; Queue has completed
       (.release cap)
       (.release (.get out))
       (.destroyAllWindows cv2)))

(defn/a main [argv]
  (let [file (get argv 1)]
       (await (process-file file))))

(if (= __name__ "__main__")
    (.run asyncio (main sys.argv)))

I hope to implement some refinements to this app, but more importantly, despite the limitations of computer vision, I remain excited and optimistic at the role AI and computer vision can play in analyzing film language.