Abstract
For patients with facial paralysis, the wait for return of facial function and the resulting vision risk from poor eye closure, difficulty speaking and eating from flaccid oral sphincter muscles, as well as the psychological morbidity from the inability to smile or express emotions through facial movement can be devastating. There are limited methods to assess ongoing facial nerve regeneration: clinicians rely on subjective descriptions, imprecise scales, and static photographs to evaluate facial functional recovery and thus facial nerve regeneration remains poorly understood. We propose a more precise evaluation of dynamic facial function through video-based machine learning analysis which would facilitate a better understanding of the sometimes subtle onset of facial nerve recovery and improve guidance for facial reanimation surgery. Specifically, we here present machine learning methods employing likelihood ratio tests, optimal transport theory, and Mahalanobis distances to: 1) assess the use of defined facial landmarks for binary classification of different types of facial palsy; 2) identify regions of asymmetry and potential paralysis during specific facial cues; and 3) determining severity of abnormal facial function when compared to a reference class of normal facial function. Our work presents promising results of utilizing videos, rather than static photographs, to provide robust quantitative analyses of dynamic properties for various facial movements without requiring manual assessment. The long-term potential of this project is to enable clinicians to have more accurate and timely information to make decisions for facial reanimation surgery which will have drastic consequences on quality of life for affected patients.
I. Introduction
There are an estimated over 500,000 cases of peripheral nerve transection requiring repair annually in the United States [1] and peripheral nerve injuries cost the US healthcare system $150 billion annually [2]. Despite this, the current understanding of the spectrum of nerve injury and regenerative outcomes is based on clinical observational data alone with unfortunately little objective data to guide surgical intervention [3]. Acquired facial paralysis occurs in over 150,000 Americans every year from a variety of causes: Bell’s Palsy, benign or malignant tumors, trauma, infections and neurologic and iatrogenic injuries [4].
While facial nerve (FN) recovery is robust in Bell’s Palsy and should be managed conservatively, for many other conditions, the prognosis is much more uncertain and may take months to years to complete. In such cases, a critical time window for successful facial reanimation surgery is passing (typically after 18-24 months of facial paralysis muscle reinnervation through a nerve transfer is unlikely to be successful due to muscle degeneration and atrophy [5]). As detecting the onset of nerve regeneration and muscle reinnervation can be subtle and difficult to assess clinically, determining prognosis for FN recovery represents a black-box scenario with limited data to guide clinicians and patients. In the setting of no overt visible facial movement, the choice to proceed with facial reanimation surgery (such as a cranial nerve V-VII transfer) is fraught with difficulty [6] [7] [8] [9].
There are limited objective methods to quantify FN function; clinicians rely on subjective descriptions, imprecise scales, and static photographs to evaluate facial functional recovery [10] [11] [12] [13] [14]. There are no current imaging modalities to detect the onset, progression (or failure) of nerve regeneration [3]. The lack of objective FN outcomes data and lengthy recovery times has hampered the field of facial reanimation surgery for some time [15]; for example, only 24% of patients requiring FN sacrifice during parotid tumor excision receive FN repair [16]. For patients with facial paralysis, the wait for return of facial function and the resulting vision risk from poor eye closure, difficulty speaking and eating from flaccid oral sphincter muscles, as well as the psychological morbidity from the inability to smile or express emotions through facial movement, can be devastating.
A. Current Methods to Quantify and Track Facial Nerve Function
The current methods for quantifying facial nerve (FN) function primarily include clinical grading systems such as House-Brackmann (HB) score and the Sunnybrook Facial Grading System, and collection of facial electrophysiology signals. While these tools have been widely used, they do have limitations in capturing the full spectrum of FN palsy.
House-Brackmann (HB) Score
The HB system was approved by the American Academy of Otolaryngology-Head and Neck Surgery FN dysfunction committee as the reference standard for grading facial palsy [17]. The system uses a six-point scale, where grade I corresponds to normal and grade VI to complete flaccid paralysis. However, it relies on subjective clinical judgment, leading to potential interobserver variability. The discrete grading may not capture subtle changes, making it less sensitive to nuanced improvements or deteriorations.
Sunnybrook Facial Grading System
The Sunnybrook facial grading system comprises a regional scale involving facial symmetry at rest, voluntary movements, and synkinesis, or unwanted contractions of muscles during attempted movement. The composite score ranges from 0 to 100, where 100 corresponds to normal facial function and 0 corresponds to complete paralysis. The Sunnybrook classification system has been reported to assess facial synkinesis, involuntary movements, better than the HB systems [18]. Like the HB score, it involves subjective clinical assessment, and the scores may not fully reflect patients’ perceptions of their facial function.
Electrophysiological methods
Objective measures that collect facial electrophysiology, such as needle or evoked electromyography (EMG), are invasive, painful, and may cause patient discomfort. EMG measurements vary based on practitioner experience by up to 20% [19], are logistically difficult to arrange and track over time, may be impacted by edema in post-surgical or trauma patients, and do not inform prognosis of recovery in if complete facial paralysis is present (HB VI). Furthermore, EMG may be effective in detecting significant changes in facial muscle activity, but there have not yet been studies that show it is sensitive enough to track subtle or gradual onset of facial palsy. Therefore, electrophysiological methods are less suitable for monitoring early stages of FN dysfunction or recovery.
B. Applications of Machine Learning in Analysis of Facial Palsy
Machine learning (ML) algorithms for facial analysis have advanced significantly in the past few decades [20] [21] but are based on training datasets of intact facial function. Application of these tools to facial paralysis patients photographs was reported to cause significant landmark inaccuracy that precludes clinical usage [22] [23]. The most advanced method of automated facial analysis in facial paralysis patients, Emotrics, is limited to static photos only and requires significant manual landmark adjustments [23]. Other attempts to quantify facial function have incorporated proprietary marketing software to estimate facial emotional expressions [24] [25] [26] [27] or were limited to a single surgical case report [28] following surgical interventions. An objective, open-source, rigorous quantification of dynamic facial function in videos has remained elusive [28] [29] [30], as there are multiple challenges to accumulating a sufficiently large training dataset to improve landmark accuracy in the facial paralysis population.
There is an unmet need to develop a more precise evaluation of dynamic facial function to facilitate a better understanding of the spectrum of facial palsy, improve guidance for and outcomes following facial reanimation surgery, and potentially detect the subtle onset of FN recovery (for which there currently is no reliable test [15]). Contrary to contemporary beliefs regarding the limitations of ML algorithms, the latest open-source computer vision and ML algorithms available through Python (OpenCV, dlib) [21] [31] on facial palsy patients recently revealed surprisingly robust landmark accuracy despite significant facial asymmetry [32]. Specifically, a recent finding demonstrated a significant reduction in landmark error rate in videos compared to photos in a standardized data set across all facial palsy severity types, in part because the volume of data available from a video of a standard FN exam is orders of magnitude larger than static photography (from 8 photos to 3600 frames for a 60 second video recorded at 60fps) [32].
Building upon recent accomplishments in reducing landmark error rates in videos compared to photos, we here present machine learning methods employing likelihood ratio tests, optimal transport theory, and Mahalanobis distances to: 1) assess the use of defined facial landmarks for binary classification of different types of facial palsy; 2) identify regions of asymmetry and potential paralysis during specific facial cues; and 3) determining severity of abnormal facial function when compared to a reference class of normal facial function. The paper concludes with a discussion of the results and future directions of the presented work.
II. Materials and Methods
A. Data Collection
Videos of normal subjects and patients clinically diagnosed with facial palsy were prospectively gathered. The Massachusetts Eye and Ear Infirmary (MEEI) Facial Palsy Photo and Video Standard Set [33] consists of videos from 50 subjects and the University of California, San Diego (UCSD) Facial Nerve Database consists of videos from 15 patients clinically diagnosed with facial palsy. Thus, a total of 65 patients diagnosed with having abnormal facial function, from the combination of these two datasets, was used in this study. Additionally, videos of 50 healthy controls were gathered at Stanford University. Here, we provide a brief description of the two databases used, collection of healthy control data, experimental setup, and tasks. Figure 1a summarizes the data used in this study.
MEEI Facial Palsy Photo and Video Standard Set
This dataset is an open-source, hosted on the Sir Charles Bell Society website, standardized set of facial photographs and videos that captures the spectrum of flaccid and nonflaccid facial palsy [33]. Normal intake protocol was followed, including a clinician assessment of facial function (eFACE) as well as a full set of photographs and a video. Exclusion criteria included prior facial reanimation surgery or extensive facial scarring, currently active chemodenervation, and bilateral facial palsy. Cases from both flaccid and nonflaccid (aberrantly regenerated or synkinetic) states were included resulting in 25 flaccid palsy subjects and 25 synkinetic palsy subjects.
Subjects who consented to enroll in the standard set were categorized by their eFACE score into the following: normal, near-normal, mild, moderate, severe and complete flaccid or nonflaccid facial palsy. The degree of palsy was quantified using eFACE, House-Brackmann (HB), and Sunnybrook scales by two expert clinicians in the field. The fifty facial palsy subjects from this database were defined as having abnormal facial function and the remaining 10 subjects defined as having normal facial function for this study.
UCSD Facial Nerve Database
IRB approval was obtained from the Office of IRB Administration at UCSD prior to beginning this study. Patients who have facial palsy greater than a HB score of 1 are currently being recruited from the UCSD Facial Nerve Clinic. A total of 15 subjects with known facial palsy were used in this study, of which 11 have flaccid palsy and 4 have synkinetic palsy.
Healthy Controls
A total of 50 healthy controls were used in this study. Ten healthy controls (5 male and 5 female) that were included in the MEEI Facial Palsy Photo and Video Standard Set were used. The remaining 40 individuals (20 male and 20 female) consists of healthy volunteers, who have no facial movement disorders, recruited.
The Facial Nerve Exam
All the subjects included in this study performed the same standard facial nerve (FN) exam, in which the subject was requested to perform the following facial cue for a few seconds before returning back to face at rest.
Face at rest: used as baseline
Raise their eyebrows
Gently close their eyes
Forcefully close their eyes
Gentle smile
Full effort smile
Pucker their lip
Show their bottom teeth
Figure 1b presents the intended facial muscle activation for each of the given cues. The eyebrow raise cue aims to target frontalis muscle activation, the eye closure cue targets orbicularis oculi activation, the smile cue targets zygomaticus major and minor, levator anguli oris, and risorius activation, the lip pucker cue targets orbicularis oris activation, and the bottom teeth cue targets depressor inferioris and depressor anguli oris activation.
B. Data Processing
Facial Landmarks Extraction
The process to extract 38 facial landmarks from a face within one input image is given as follows [34]:
Extract the number of frames in the input video.
For each frame in the video:
Convert the color input image to gray scale.
Detect the face on the image using the open-source and publicly available dlib and OpenCV Python libraries.
Store the coordinates in the selected frame.
Save coordinates for all the frames in the video as a .csv file for future data processing.
Note that the MEE shape predictor is trained to detect 68 points, but only 38 of them were used in this work. These 38 points were reorganized, as seen in Figure 2, to proceed with the computational measures.
Segmenting Data for Specific Cues
The time points of when each FN exam cue started and ended in a subject’s video was observed and recorded. These time points were converted to specific frame numbers based on the frames per second of the inputted video. With these frame points indicating the start and end of each cue, the landmark coordinates could be segmented as such and stored as a Python dataframe for future processing.
Pre-Processing Steps
In order to account for camera drift:
The x and y coordinates of each facial landmark was subtracted from the average coordinates during the face at rest cue.
The x and y coordinates of every facial landmark was then subtracted from the midpoint of the subject’s face as identified by dlib.
C. Implementing supervised learning for hypothesis testing
Feature selection
In order to choose features that can probe aspects of facial symmetry, we extracted the x,y positional coordinates from the 38 landmarks and computed their difference relative to the midline of the face. An analogous procedure was done for the left and right y-position coordinates. Therefore, each patient would have a Y ∈ ℝn×d, where n = 38 and d represents the number of frames in the patient video. We observe statistically significant differences (p < 0.001) in some key facial regions, such as the lip corner (oral commissure) and outer and inner lip features, between the average y-position difference of the abnormal patient class and the healthy controls, as shown in Figure 3a.
A similar process can be applied to the landmark coordinate data segmented at selected cues, resulting in . During each cue, we observe statistically significant differences in varying facial regions, with highest significance in context-specific regions of interest. For example, during the eyebrow raise cue, there is a significant difference between the two subject groups for the eyebrow features (p < 0.01) and upper eye features (p < 0.05) (Figure 3b). Whereas, during the smile cue, there is a significant difference between the patient groups for the outer and inner lip features (p < 0.001) and lip corner (oral commissure) features (p < 0.001) (Figure 3c).
1) Setup for binary classification
Our setup for binary classification can be described as having labeled data (Y (1),…, Y (k)), which are our arbitrary feature vectors for every subject in ℝk, and (H1, …, Hk) such that Hi ∈ {0, 1}, where 0 is our label for normal facial function and 1 for abnormal. The goal of binary classification is to find a decision function that allows us to use the input features Y (i) that provide the corresponding class labels Hi. This can be viewed as a hypothesis testing problem, where it is assumed that under H0, Y has a joint density fY (y; θ0) and under H1, Y has a joint density fY (y; θ1).
Both hypothesis classes were modeled as a multivariate Gaussian distribution, with mean μ and covariance matrix Σ, namely Y ∼ 𝒩(μ, Σ). Then, note that its density is given by We can utilize supervised learning to partition the samples for each class, where we use 70% of the dataset for training and the remaining 30% for testing. The training dataset has the labels for their associated classes and so we can calculate distribution parameters, θi, for every training sample under H0 and under H1.
2) Clustering probability distributions using Wasserstein barycenters
Optimal transport theory is a geometrically meaningful way of measuring distances between probability distributions, and it has recently become important in applications in data science and machine learning [35]. In this work, we utilize the concept of barycenters to find the “center of mass” of a collection of probability measures, where distances between each are determined by the optimal transport Wasserstein distance.
Consider a set P1, …, Pk of elements, with their associated weights λ1, …, λk, satisfying λi > 0 and , belonging to a metric space M. The barycenter P * on this metric space (M, dM) is defined as [36]: If we were to consider the set P1, …, Pk of elements to be probability distributions, where the metric space (M, dM) corresponds to the 2-Wasserstein distance, then P * becomes the Wasserstein barycenter [37]. when P1, …, Pk are all multivariate Gaussian distributions with parameters (μi, Σi), then the mean of the Wasserstein barycenter is simply and the covariance Σ* is the only positive definite matrix Σ satisfying the equation [38]: Using the probability distributions of each training samples in either hypothesis class, the Wasserstein barycenter was calculated to define distribution parameters, and , with for each distribution. Then, with the test dataset, where the label of the subject was unknown, a likelihood ratio test was performed on each sample, using the learned parameters, and from the training dataset. A likelihood ratio test was performed under the two Gaussian Wasserstein barycenters: This supervised learning method for hypothesis testing was used to classify between: 1) normal and abnormal subjects, 2) normal and synkinetic subjects, and 3) normal and flaccid subjects.
D. Identifying regions of dynamic facial asymmetry using specific facial cues
Pearson correlation coefficient to define facial symmetry
Pearson correlation coefficients are well-suited for capturing linear relationships between facial landmarks. The coefficients quantify the degree of linear dependence between two variables, making them effective in identifying proportional changes in facial features during dynamic expressions. Furthermore, the magnitude of the Pearson coefficient reflects the strength of the correlation, allowing for a quantitative assessment of how facial features move in tandem. The sign of the coefficient indicates the direction of the relationship, where a positive correlation implies a simultaneous increase or decrease in facial features, and a negative correlation suggests an inverse relationship.
From the 38 facial landmarks described in Figure 2 Pearson correlation coefficients were calculated for each of the corresponding right and left landmark, resulting in a coefficient vector, v ∈ ℝ19. These coefficient vectors can either be calculated to capture correlation across features for the entire duration of the FN exam or for the duration of a specific facial cue that was executed during the exam.
2) Mahalanobis distance as a metric to determine context specific asymmetry
To compute how much a patient deviates from normal facial function, the normalized Mahalanobis distance (MD) were calculated for each FN exam cue. This metric identifies which of the cues has the most significant dynamic asymmetry when compared to normal function and gives insight into the specific region and context, or facial cue, of asymmetry.
Though seven cues were provided in the FN exam, the two eye closure cues and the two smile cues were grouped together. Therefore, the five facial cues of interest include eyebrow raise, eye closure, smile, lip pucker, and showing bottom teeth. For each facial cue, the MD was used to measure the degree of error from normal facial function based on the 19 Pearson coefficients of the symmetric facial landmarks. The MD takes into account the variability and correlation structure of the data, making it suitable for multivariate analysis, since the Pearson coefficient vectors, v ∈ ℝ19. The following is how MD was computed for each cue:
The Pearson correlation coefficient vectors was calculated for all 50 normal subjects, V ∈ ℝ19×50
The mean of the Pearson coefficients for all normal subjects was calculated, μ ∈ ℝ19. This represents the average correlation values across all facial regions.
The covariance matrix of the Pearson coefficients for all normal subjects was determined, Σ ∈ ℝ19×19. This describes the relationships and variability among the different facial regions.
For the patient of interest, the MD was calculated at each of the 5 cues:
where X is the vector of Pearson coefficients for the patient of interest, μ is the mean vector of all the normal subjects, and Σ is the covariance matrix of all the normal subjects.
Once all the distances were calculated for every cue D = [Deyebrow−raise, Dsmile, …, Dpucker], the five distances were normalized allowing us to determine which cue resulted in maximum deviation from normal function.
The MD will provide a measure of how far away the abnormal patient’s data is from the average normal facial symmetry, considering the multivariate distribution of the normal subjects Pearson coefficients. A higher MD indicates that the abnormal patient’s facial function is more divergent from the normal range.
E. Using multivariate outlier detection to identify abnormal facial function
Wasserstein barycenter distance as metric to define abnormal facial function: The Wasserstein distance quantifies the minimum “work” required to transform one probability distribution into another. As it defines a metric space within the space of probability measures, it can be applied to compare the distribution of facial feature positions during expressions in individuals with varying degrees of facial palsy to a reference distribution representing normal facial function. To calculate the Wasserstein distance of one multivariate Gaussian distribution to another Gaussian, d := W2(𝒩(μ1, Σ1); 𝒩(μ2, Σ2)), the following closed-form equation can be used [35, eq 2.41]:
To extend the notion of the MD that was used for the v ∈ ℝ19 of Pearson correlation coefficients to the patient specific y-position coordinates, Wasserstein distances were calculated. This distance to the cue-specific barycenter of all healthy controls was calculated for each subject in this study (n = 115):
The cue in which a subject deviated most from the average normal function was identified by using normalized MDs. The cue with the largest MD was identified as having maximum deviation.
For this cue, the Wasserstein barycenter that summarizes the Gaussian distribution for all healthy controls was calculated. See (2), (3).
The Wasserstein distance between the individual subject’s Gaussian distribution at the cue and the Wasserstein barycenter for all normal subjects was calculated.
This distance serves as a representative measure of the dissimilarity between an individual’s facial movements and the typical facial expressions of the healthy population. We hypothesized that patients with abnormal facial function would have noticeably larger Wasserstein distances from the averaged normal barycenter at that cue. As such, the Wasserstein barycenter distance can be useful for outlier detection because it provides a measure of how far a particular distribution is from the central tendency represented by the barycenter.
To account for outliers in the healthy controls, Wasserstein distances that were above two standard deviations from the mean were not used for the following linear regression models. As such, seven healthy controls were removed for the model selections to follow.
2) Linear regression to predict clinical scores of facial palsy
Linear regression can be employed as a valuable tool for predicting clinical scores related to facial palsy, such as the HB and Sunnybrook scales. In this context, the Wasserstein distance from the barycenter of normal facial function serves as a unique and innovative feature that captures the dissimilarity between an individual’s facial movements and a reference distribution of normal facial expressions. Both the MEEI Facial Palsy Photo and Video Standard Set dataset and the UCSD Facial Nerve Database includes clinical scores, specifically the HB score, for individuals with facial palsy. The healthy controls were assigned a HB score of I since they exhibit normal facial function in all areas. As such, the calculated Wasserstein distances were used as the independent variable and the clinical scores are the dependent variable.
III. Results
A. Performance of binary classification of facial palsy
The receiver operating characteristic (ROC) curve offers one way to measure effectiveness of predicting by calculating the area under the curve (AUC), which can be interpreted as the probability that the test result from a randomly chosen abnormal individual is more indicative of paralysis than that from a randomly chosen normal individual [39]. To determine the reliability and performance of the classification, bootstrapping (n = 500) was used to provide a range of ROC curves, helping us understand how robust the model evaluation is to variations in the dataset. The average AUC, with a 95% confidence interval, was calculated for the three binary classification scenarios described above. Figure 4 presents the ROC curves along with the average AUC values for each scenario: 1) normal and abnormal subjects: 0.9262, 2) normal and synkinetic subjects: 0.9192 and 3) normal and flaccid subjects: 0.9419.
B. Patient-specific assessment of dynamic facial asymmetry is more robust during specific contexts
The Pearson coefficients were calculated at the five facial cues of interest for all 50 healthy controls, and the average values of all 19 coefficients that correspond to the y-position of the landmarks in Figure 2 are presented in Figure 5a. It is observed that all 19 features reveal high positive correlations between the right and left side y-coordinates during the full duration of the FN exam. Heatmaps were created to spatially localize levels of synchrony for the 19 pairs of facial landmarks across the face for each of the five cues executed. The average Pearson coefficients for all healthy controls during the five cues are presented in Figure 5b, where dark blue corresponds to a positive correlation of 1 and dark red corresponds to a negative correlation of -1. During all the cues, there are strong positive correlations, above 0.90 for each cue’s regions of interest: ry−eyebrow = 0.911 ± 0.052 during the eyebrow cue, ry−eyeclosure = 0.952 ± 0.022 during the eye closure cue, ry−lipregion = 0.934 ± 0.056 during the smile cue, ry−lipregion = 0.900 ± 0.043 during the lip pucker cue, and ry−lipregion = 0.911 ± 0.066 during the show bottom teeth cue.
Figure 6a presents four sample subjects from the MEEI Facial Palsy Photo and Video Standard Set with varying types of facial palsy and their corresponding heatmaps during specific FN exam cues (please view section V regarding images of patients in figures). For example, the complete synkinetic palsy patient (SP1) (left in Figure 6a) presents palsy in the eyebrow region and upper lip vermilion and oral commissure, and this is captured in both the magnitude and direction of the Pearson coefficients in these regions: ry−eyebrow = −0.280 ± 0.187, ry−uppervermilion = − 0.228 ± 0.327, and ry−oralcommissure = 0.55. Similarly, the complete flaccid palsy patient (FP1) (second from right in Figure 6a) presents palsy in the oral commissure and upper lip vermilion, and this is captured in the Pearson coefficients of these regions: ry−oralcommissure = −0.54 and ry−uppervermilion = −0.525 ± 0.009.
When comparing the above results to the Pearson coefficients of the whole time series, rather than specific cues, we lose some of the spatial specificity and sensitivity of asymmetry but the algorithm can still identify key regions of abnormal function. Figure 6b presents the heatmaps of the same two patients described above but for the full duration of the FN exam. Over the whole time course, SP1 has the following Pearson coefficients for the initially identified regions of asymmetry: ry−eyebrow = 0.032 ± 0.0488, ry−uppervermilion = 0.685 ± 0.007, and ry−oralcommissure = 0.67. While FP1 has the following Pearson coefficients for the initially identified regions of asymmetry: ry−oralcommissure = −0.58 and ry−uppervermilion = −0.515 ± 0.077.
In the case of FP1 the regions of asymmetry are apparent during the duration of the whole FN exam, but as evident in SP1, there are specific contexts or facial cues in which their abnormal motor movement is most evident. To better determine under which context a patient’s palsy is most noticeable in, the MD was used to measure the degree of error or how off a patient is from normal facial function. Figure 6c presents a patient with moderate flaccid palsy (FP2) performing all the five facial cues, with the normalized MD values for each 19 symmetric landmarks spatially plotted on top. It can be observed that in some cues, such as the smile and lip pucker cue, FP2’s palsy is not as evident as it is during the eye closure and eyebrow raise cue. A higher MD indicates that landmarks at that specific cue are more divergent from the normal range, allowing us to identify at which context patients differ most from the norm. In FP2’s case, the highest MD is attributed to the eye closure cue, indicating that this is when the patient is the most abnormal from the normal function of all healthy controls defined in Figure 5b.
By creating separate heatmaps for each facial feature, clinicians can focus on specific regions of interest. This feature-specific analysis helps identify asymmetries or patterns related to particular facial landmarks, aiding in the assessment of facial expressions or cues.
C. Performance of using Wasserstein distances to predict clinical scores
While the heatmaps provide insight to visualize specific regions of palsy, we wanted to associate the facial landmarks to clinically validated scores used in the field of Otolaryngology–Head and Neck Surgery, such as the HouseBrackmann and Sunnybrook scale.
Wasserstein distances were calculated for all the subjects (n= 108 after 7 healthy control outliers removed). To determine the effectiveness of using Wasserstein distances to distinguish between normal and abnormal subjects, an ROC curve was plotted for 1000 iterations using 80% as the training data and 20% as testing. The mean AUC from all the iterations was 0.9152 (Figure 7a). Linear regression models between a patient’s Wasserstein distance and given HB score was determined, with a regression coefficient of r = − 0.64. A similar process was executed on a patient’s Wasserstein distance and given Sunnybrook score, with a regression coefficient of r = 0.68. As seen in Figure 7b, c, we observe similar performances in models between the two clinician scales and the calculated distance.
To determine the efficacy of using Wasserstein distances as an objective metric to classify patients into their respective clinician scores, a linear regression model was trained under a subset of data and evaluated. The data was split among each of the HB score categories using a 80%-20% stratified train-test split to preserve the proportions of subjects in each categeory. The linear regression model was trained, where the Wasserstein distance was the independent variable and the HB scores were the dependent variable. Chance-level accuracy for this six-class classification was 1/6 = 16.67%, and when all distances were trained and tested accordingly for 1000 iterations, the mean accuracy was 20.55%. As seen in Figure 8a, the mean true positive values for HB I was 0.0. To observe the performance of a linear regression model trained on only the HB I and HB II subjects, a similar model was trained using a 80%-20% stratified train-test split, resulting in a mean accuracy of 85.64% for 1000 iterations (Figure 8b).
From this, a nested regression model was implemented, such that the subset of predictor variables that were initially given a score of HB I or II from the six-class classification model underwent another two-class classification to better separate the two categories. As a result, the mean accuracy for this six-class classification nested regression model increased approximately 3 times to 59.67% for 1000 iterations (Figure 8c).
D. Application to Surgical Patient Cases
Due to the severe morbidity of chronic facial palsy, decisions in facial nerve (FN) management can carry high stakes, particularly when the degree of injury to the FN is uncertain, as a stretch or compression injury can mimic a complete transection in the early stages of recovery. Here we present two cases of facial palsy, the first due to an iatrogenic injury and the second due to head and neck cancer, that was tracked over the course of time. We present how our proposed algorithms in these two cases could help inform clinicians of potential onsets of FN recovery or worsening paralysis, that can guide their decisions to improve patient outcomes.
1) Case 1: Tracking of subtle onset facial nerve recovery for an iatrogenic patient
Iatrogenic FN injury has a reported incidence of 11% to 40% [40], where rates can vary depending on the type of surgical procedure done and expertise of the surgeon themselves. There is consensus amongst peripheral nerve surgeons that the critical window following acute injury during which nerve repair is feasible before scar formation is up to 72 hours—although with contemporary instruments, this timeline may be extended several weeks [41]. It is recommended that reconstruction of FN damage be done during this acute phase to avoid atrophy of the target facial muscles [41]. However, subtle changes in FN performance are difficult to evaluate both from an observational manner and from current measurements, such as scoring scales like HB or EMG, which only yields 33% accuracy in traumatic injury patients [42].
This case presents an iatrogenic patient (IP), Figure 9, who underwent a tumor excision where the surgeon was not certain whether a cranial nerve was damaged or not. A FN exam was performed the week following injury and then again 3.5 weeks after (please view section V regarding images of patients in figures). Figure 9 present the Pearson correlation coefficients during both trials, Figure 9a at 0 weeks and Figure 9b at 3.5 weeks. Differences in Pearson coefficient values are observed specifically during the smile cue and the lip pucker cue; the increased magnitudes of the eyebrow (p < 0.01) and eye (p < 0.05) features increase after 3.5 weeks are statistically significant (Figure 9c), indicating increased facial dynamic symmetry. However, since the magnitudes of coefficients for the lip region features significantly decreased (p < 0.05), IP still has not complete normal facial function. Furthermore, Figure 9d presents significant increases in the eyebrow feature coefficients (p < 0.01) during both the lip pucker and smile cues.
The MD was used to identify the context under which palsy was most apparent for IP, and Figure 10a presents the smile cue being the most deviation from normal function. With this context, the Wasserstein distances were calculated for IP and the linear regression model was used to predict HB scores. The model predicts IP to have a HB score of V at week 0 and a HB score of III at week 3.5, as seen in Figure 10b. These predicted scores match the clinician assigned scores, which were HB V at week 0 and HB III at week 3.5, with a mean square error (MSE) of 0.5 for the two predictions.
Due to the signs of improvement over this time duration, IP’s clinician suggested to hold off on surgical intervention and continue monitoring her recovery over the next few months. The proposed visualization algorithm and metrics are able to guide clinicians of IP’s overall FN recovery over the 3.5 weeks. Therefore, these tools can be used to supplement a clinician’s decision on whether reanimation surgery is necessary or not, proving to be useful when a time-sensitive decision needs to be made.
2) Case 2: Tracking facial nerve function preand post-facial nerve sacrifice and repair of a head and neck cancer patient
Subtle, progressive and non-resolving facial palsy can be the sole initial clinical manifestation of perineural spread of head and neck cancer occurring along the facial nerve (CN VII), and be misdiagnosed as a benign process such as Bell’s Palsy in cancer survivors, particularly in the absence of a discrete tumor. Facial paralysis continues to pose a significant concern during cancer resection and facial nerve interventions are often not prioritized amidst urgent oncologic treatments [43]. Detection of subtle changes in facial nerve function preoperatively could greatly aid surgical planning as the optimal time for facial nerve reconstruction is at the time of tumor resection. For patients who develop subtle facial weakness following cancer treatments with an intact facial nerve, additional metrics tracking function could potentially detect cancer recurrence at an earlier date and lead to more effective, earlier treatments. Monitoring outcomes after facial reanimation surgery could also greatly advance with more robust, objective metrics that what are currently utilized.
This case presents a patient with a right parotid malignant tumor (CP), Figure 11, who underwent a radical right parotidectomy with a FN sacrifice, lateral temporal bone resection, and neck dissection to remove the tumor. Other reanimation procedures were performed on CP such as a fascia lata and suture static suspension to the nasolabial fold, nerve harvests and grafting for FN repair, and placement of a eyelid weight in their right eyelid (please view section V regarding images of patients in figures). Videos of CP performing the FN exam were taking 3 weeks prior to surgery when they started developing FN weakness, 8 months after surgery to remove the tumor and repair FNs, and 2 years after the surgery.
Figure 11 tracks CP’s initial onset of FN weakness and FN function 8 months and 2 years after surgery. The initial FN weakness was initially most apparent under the eye closure cue, where slight weakness is observed in the lower eye regions and oral commissure areas (Figure 11a). Figures 11b, c present worsening paralysis in the eyebrow region after surgery with the maximum deviation contexts varying from the smile cue at 8 months to the bottom teeth cue at 2 years.
With the maximum deviation contexts determined by the MD algorithm (bottom of Figure 11), CP’s Wasserstein distances were calculated at each time point. The linear regression model was used to predict HB scores both before and after surgery. The model predicts an initially HB score of III when they first develop FN weakness 3 weeks prior to surgery (Figure 12a). Once CP had their FN sacrifice surgery, CP developed complete flaccid paralysis and was diagnosed with a HB score of VI after surgery. CP visited the clinic at 4 months post-surgery, and static photographs of the FN exam were taken at to capture the extent of CP’s flaccid paralysis. However, the study’s analyses could not be applied as a video, rather than photos are required. The model predicts a HB score of IV 8 months post-surgery, when CP developed mild synkinesis. (middle of Figure 12). the score increases to a VI 2 years after surgery (right of Figure 12). These predicted match the clinician assigned scores, which were HB III 3 weeks post-surgery, HB IV 8 months post-surgery, and HB III 2 years post-surgery, with a MSE of 1.0 for the three predictions.
This case successfully captures the trends of CP’s initial FN weakness before surgery and the worsening of paralysis after the tumor was removed, and the regression model matches clincian HB scores fairly well. These visualization algorithms and metrics can therefore be applied to various causes that may lead to onsets of FN weakness. Furthermore, implementation of this algorithm in similar cases could provide guidance to surgeons on whether tumor resection needs to be performed earlier, leading to better patient outcomes post-surgery.
IV. Discussion
Current methods of facial nerve (FN) assessment rely on clinician experience and subjective scales such as HouseBrackman (HB) or Sunnybrook [44], [45] which can miss subtle changes to a patient’s FN exam such as interval asymmetry or segmental weakness. Clinician-based evaluation of facial function is inherently subjective with a reported interobserver variability up to 64% for severe facial palsy (HB III or worse) [46]. Machine learning (ML) algorithms for facial analysis have advanced significantly in the past few decades [20] [21] but are based on training datasets of intact facial function; application of these tools to facial paralysis patients photographs was reported to cause significant landmark inaccuracy that precludes clinical usage [22] [23]. An objective, rigorous quantification of dynamic facial function in videos has remained elusive [29] [30] [28] as there are multiple challenges to accumulating a sufficiently large training dataset to improve landmark accuracy in the facial paralysis population. Recent work has demonstrated that the latest the open-source computer vision and machine learning facial analysis [34] on videos of facial paralysis patients reduced landmark errors significantly compared to static image analysis [32]. In this study we report for the first time a novel technique of using barycenters and wasserstein distributions to develop robust ML techniques to classify between normal and abnormal facial function in videos of facial palsy patients and one of the few papers that uses robust metrics to predict commonly used clinical grading scales like HB and Sunnybrook.
Although our ability to read and interpret facial expressions is an evolutionary trait learned at a young age, our understanding of normal facial movements [47] and ability to describe it mathematically is limited, particularly in the asymmetric and asynchronous movements of facial palsy patients. In our study, we demonstrate Pearson correlation coefficients as a valuable metric for quantifying the degree of symmetry in facial movement over the duration of a video. By leveraging the coordinates of facial landmarks across multiple frames for certain cues, these correlation coefficients provide a robust statistical measure of the linear relationship between facial movements, while also allowing us to identify regions of asymmetry. This video ML analysis could for the first time characterize the dynamic properties of various facial movements, like smiling versus raising your eyebrows, without requiring manual assessment, and significantly advance our understanding of facial function. This context specific facial function is what drove the final analyses of determining the Wasserstein distances that aided in developing prediction models to map back to the clinical HB score. Focusing on specific facial cues where paralysis is most pronounced in palsy patients enhances the accuracy of ML models by reducing generalizability and honing in on critical regions of dysfunction. Additionally, it offers clinicians valuable context regarding which facial regions require particular attention, optimizing their diagnostic and surgical strategies for patients.
There are several limitations of this work. Firstly, the gold standard for facial analysis remains the human eye and landmark errors still occur during videos of facial palsy patients. We have found particular facial features and certain cues that decrease landmark accuracy such as eyelids during forceful blinking, as well as facial rhytids, facial hair, increased BMI, and patient movement are some of the suspected sources of error. Patient effort can also affect the range of facial movement, particularly in the facial palsy patient population, who often consciously or subconsciously avoid large movements like smiles to mask their asymmetry. Other errors in our study could be attributed to human error during the FN exam resulting from unstable camera setups and other noise artifacts. We hope to continue developing stronger pre-processing steps after facial landmark extraction to increase the accuracy of our presented clinical score prediction models.
Uncertainty analysis has been previously investigated in sleep tests for obstructive sleep apnea with the primary advantage of automated identification identification of difficult to analyze data that is flagged for “clinician-in-theloop” review [48]. One important finding that is likely reflected in other clinical datasets analyzed by ML tools, is the notable lack of perfect concordance among the “gold-standard” clinician review. The elements of doubt, uncertainty or gray areas are well recognized as part of the art of medicine; providing uncertainty metrics allows for this information to be automatically reported for additional review. We hope to conduct future investigations in developing insight on uncertainty metrics that could trigger additional clinician review and augment the accuracy of our facial functional rating.
Furthermore, increasing the database of both healthy controls and facial palsy subjects in a controlled setting will allows us to better understand the spectrum of normal dynamic facial function, particularly with the limitations of available facial palsy datasets. In the case of the surgical head and neck cancer patient (Figure 12), the prediction model had difficulties identifying the patient’s transition from complete flaccid paralysis to moderate synkinesis post-surgery. Increasing facial palsy databases will facilitate a comprehensive analysis of facial dysfunction, enabling a detailed comparison between synkinetic and flaccid patients. By discerning the distinctive patterns and nuances in each type of palsy, we can develop more accurate prediction models tailored to differentiate between them, thus advancing personalized treatment strategies and improving outcomes for individuals affected by facial paralysis.
Future applications of dynamic facial analysis are manifold. Recent advances in telemedicine technology have enabled secure, patient-driven, HIPAA compliant, iOS telemedicine applications that have applied advanced computer vision and artificial intelligence techniques for diverse purposes such as tracking rehabilitation outcomes after knee arthroplasty [49], epilepsy monitoring [50], to assessing liver steatosis from donors [51]. There are currently half a million survivors of Head and Neck Cancer in the US. Development of this technology could empower patients to track their own facial function after Head and Neck cancer therapy with a telemedicine tool that employs Machine-Learning algorithms for facial analysis. Providing patients with this tool to monitor their facial function remotely could potentially reduce diagnostic delays, facilitate objective data that can be shared with clinicians through telemedicine, and empower patients with ownership of their own data. Most critically, this tool will be available across socioeconomic and insurance statuses as the vast majority of Americans (97%) own a cellphone and nine-in-ten own a smartphone [52]. Advances in technology and a shift towards more patient-centered assessments may contribute to improving the accuracy and sensitivity of FN function evaluations and facilitate an understanding of dynamic facial function in a manner not previously possible.
Data Availability
All data produced in the present study are available upon reasonable request to the authors.
V. Addendum
Due to pre-print screening requirements, images of the facial palsy subjects discussed in this study will only be released in the final publication. Readers may contact the corresponding author prior to final publication to request access to these materials.
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵