I’ve thought for a while now that the next big leap in consumer electronics will be personal sensors, meaning—most importantly—video and audio.
If people were just recording video and audio while they walked around, that might have some sort of utility, but probably not much. It’d be more effort than it was worth to sift through the content and find something interesting.
The actual breakthrough will be when people are wearing these sensors and the data from the sensors are being parsed by computers.
Imagine that you’re able to constantly record just video and audio, in all directions, as you walk around during your day. Forget how that’ll happen—it’ll likely be some sort of button or sticker, or other wearable item—but just assume that we can capture the data.
Now imagine a decent set of machine learning algorithms on the other side of the sensor, both at the edge and in the cloud, that are processing the data in realtime. What kinds of functionality will come from this?
- Realtime language translation.
- Notifications if someone is looking at you or talking about you.
- Notifications when someone around you is talking about something dangerous.
- An alert if there is an incoming car or bicycle you might not see.
- An alert if someone has been following you for blocks without you knowing.
- A ping for if a friend’s voice is heard, or their face is seen, in a crowd.
- A ping if someone in the coffee shop is famous or popular in some way, or if they share common interests with you.
- Various statistics for a given time period, e.g., how many men vs. women you passed on the street, how many cars went by, how many Android vs. iPhone devices are in this coffee shop.
- How many calories you ate. (it can see all your food)
- How many calories you burned (it can watch you exercise)
- How much sleep you got (when you turned it off and turned it back on, plus other measurements).
- How many times you talked to your friends and family vs. strangers, etc.
- Take pictures of any scene using a voice command.
- Get 360 degree images of places.
- Based on your excitement level (accelerometer data, heart rate, etc.), automatically capture video or audio and pipe it to certain people, companies, or organizations. Imagine crashes, or robberies, or health issues.
- People cutting you off in traffic, littering, public fights, documentation of violence of all kinds.
These are just a few I thought up in a minute or two. There will be thousands of use cases, created by hundreds of companies. Some companies will be great at identifying objects. Others will be good at voice. Others will be good at situations and scenes, etc.
And all of this will filter through your personal operating system so that the insights that are gained are passed on to the rest of your personal platform—at home, in the car, and at work.
Sensors connected to machine learning algorithms are powerful. Especially when you have video and audio. But the true power of this combination will come when we’re wearing those sensors and the observations, notifications, and alerts produced by the algorithms are uniquely valuable to us as individuals.
That’s when the market will explode.
Then there’s a whole separate industry that will rise up around the interfaces for presenting this data to the person. It’ll start with AirPods and Buds, i.e., voice prompts because they’re the first hands-free interface we’ll have. The next breakthrough will be a usable visual interface that is hands-free, meaning glasses.
Lots of other tech is interesting right now, but this is the major bump I’m waiting for.
- I wrote about lifecasting back in 2008, but I had no idea then about computers parsing the input from the streaming. It’s quite entertaining to read that piece now. Still lots of good stuff in there, but the assumption was human services would be used to parse the content.
- By the way, I’m not saying this is “next” as in coming immediately. I mean it’s the next big thing, at least that I care about.
- There are a lot of similarities between this and Universal Daemonization, which I wrote about in my book. The difference is that this is parsing light and sound, where UD requires the daemon infrastructure.