Facial Gestures and Eye Tracking as Computer Inputs

September 3, 2017

There’s a lot of talk of voice becoming a major computer interface in the coming years. I agree with this, and I talked about it in my book.

But there are some major limitations to voice.

It’s audible (that’s the whole point) so it’s not great for use in extremely quiet places like libraries, or extremely loud places where the system won’t be able to hear you.
Compared to other input methods like pressing a button or making a subtle gesture, voice is slow. It takes time to form a response, and it takes time to vocalize that response.

Apple is supposedly releasing an iPhone that uses facial recognition for its authentication system. It’s evidently using some sort of 3D facial scanning technology that is faster and more accurate than TouchID.

While using this for authentication is interesting, I think the idea of using this as an input systems if far more so.

If the tech is good enough, then when you’re looking at your phone you could simply do the following to control your phone.

Smile (like)
Frown (dislike)
Snarl (dislike)
Blink your eyes while looking at something (select)
Look bored (meh)
Have your pupils dilate (involuntary) (love)
Roll your eyes (bored) (meh)
Tilt your head slightly one way or another (control interface)
Slightly nod (accept)
Slightly shake the head (decline)

Etc.

The key thing is that this wouldn’t be the entire interface. You’re also holding the phone, so you could use your thumbs to do some of this, e.g., swiping side to side with the thumb to go backward and forward.

And you’re also using voice at the same time.

So imagine a system where you’re holding your device.

You say "Show me things about the new AirPods."
It brings you results instantly.
You look at a result and blink quickly, and that link opens.
You very subtly frown, snarl, shake your head, or thumb swipe backward to go back to your results.

ANY of those. Or all of them.

The insane part is how little you’ll have to contort your face. It’s not like you’ll have to make these exxaggerated expressions or head motions.

No.

Over a short training period, combined with billions of human input cases that continually teach the machine learning, you’ll soon be using the system extremely naturally. You’ll feel as if you’re just looking at your phone, and your NORMAL reactions to content and interfaces will be enough to send the commands that you wanted to send.

And of course, the system will also be learning your individual preferences for the mixture of these inputs you prefer, i.e., how much emoting, how much voice, how much touch, etc.

The power here, and the potential for misuse, will be unbelievable.

Apps like Facebook that show you content and want to know how much you liked it will make sure you’re incentivized to enable all the "EUIs" (emoting-based user interfaces) on top of the standard touch and voice options.

Why?

Because they’re going to (with your permission of course, which everyone will give) AUTOMATICALLY record likes, dislikes, loves, etc. So when your pupils dilate, or you focus on something for a long period of time, or you tear up at something sweet, the system will capture that response—no matter how subtle it is—and will do something with it.

At the shady level, companies like Facebook will be able to know exactly what makes people angry, sad, willing to purchase, inspired to act, etc., and it will use that data to serve content that gets more reactions.

Users will like it because the system will "just know" what you wanted to do anyway, in the vast majority of cases. It’ll start by prompting you.

Like this content?

Because it read your face and it knew you did.

The reason that it’s going to require some interaction, though, is because you might see someone’s new boyfriend, or their best piece of artwork, and your natural reaction might be:

Ewww…

(with the face that comes with it)

Get a weekly breakdown of what's happening in security and tech—and why it matters.

Imagine automatically sending EWWW to all your friends’ content whenever you genuinely felt that way about something they shared.

It would hurt a lot of friendships.

Ditto for "Should I send LUST?" for pictures of your buddy’s new wife or husband.

Hey Karen, I just noticed you had EUIs turned on and you got all hot and bothered looking at my husband in my wedding video I just posted. Consider yourself uninvited to the ski trip this weekend.

Technology giveth, and technology taketh away. In this case it’s giving you better responsiveness and prediction and taking away some relationships that are based on politely social deception. (see: lots)

The power (and danger) of applying cameras and machine learning to the human face is that it can tell a computer not just what we want to tell it, but also what we want that we didn’t know we wanted, and that’s one very significant step towards reading our thoughts.

When the subconscious controls our expressions, and our expressions can be read and interpreted by computers, this type of interface becomes a window to our actual self.

And our actual self is scary. So much of interacting with people and society is built upon not showing the actual self, but instead the self that you’re actively projecting for a purpose. And cameras + ML will cut right through that for all but a few who know how to control it.

And as I wrote about in my Lifecasting piece in 2008, what kind of society will it be when everyone knows that cameras are watching? Especially when those cameras are basically ML algorithms with eyes.

One result will be obvious: more and more people will become good at not emoting, not saying anything controversial, or—in other words—not being themselves.

You go into public and you become stoic. People will probably wear masks to hide the algorithms from reading their expressions as they are presented content on the street, or as they see people around them. They wouldn’t want to be considered rude for the look of disapproval that their face accidentally sent when they saw someone.

Anyway. Sorry to head down the downside path. I’m a security guy, so it happens.

Expect to see this hybrid type of interface sooner rather than later—especially now that we appear to have the beginnings of the tech that can enable it.

(Blink twice if you liked this article.)