Worse, even if you do create hundreds or thousands of such queries (which Amazon is trying to do with Alexa Skills), you haven’t solved the problem, since there is no way for the user to know what they can ask, nor remember what skills Alexa does and does not have. The ideal number of skills for such a system is either 3 or infinity, but not 50 or 5000.
This means voice can work very well in narrow domains where you know what people might ask and, crucially, where the user knows what they can and cannot ask, but it does not work if you place it in a general context. That, I turn, means I see these devices as, well, accessories. They cannot replace a smartphone, tablet or PC as your primary device.
If you’re not following Benedict Evans’ work, you should be. His annual presentation on the state of tech is among the top three such reports in the world in my opinion.
But I do occasionally disagree with him. When we talked in person he dismissed the idea of understanding a long-term strategy for consumer IoT, basically saying that everyone knows the grand strategy, and that the only thing that matters is the next steps. I disagree. I think companies like Google, Apple, and Microsoft should be talking a lot more—even this early—about the overall lifestyle integration play that we’re all moving towards.
But more tactically, I think Benedict is wrong about voice interfaces. He’s made points like this many places:
ML means we can use voice to fill in dialogue boxes, but the dialogue boxes still need to be created, one at a time, by a programmer in a cubicle somewhere. That is, voice is an IVR – a tree. We can now match a spoken, natural language request to the right branch on the tree perfectly, but we have no way to add more branches except by writing them one at a time by hand.
I think the solution here is fairly straightforward, although not trivial.
The voice platforms simply need to capture enough ways to say the same thing that they reach a certain confidence level across the population as a whole.
People don’t need perfection from their lifestyle tech, but they need a high confidence rate. I don’t study this, so I’ll just say 9/10, or “usually”, or “the vast majority of the time” is the standard we’re shooting for.
With a voice interface there are only a certain number of cases we need to execute perfectly on before that particular use case (say, asking about the weather) is considered perfected.
I’m really just guessing on these numbers, but I think they’re in the right magnitude.
There are probably a dozen common ways to ask about the weather (who knows, could be higher), but as you get to the second dozen those scenarios get dramatically less likely. And at, say, 36 different ways, you’ve probably covered the 99.9%.
Now it’s a matter of collecting use cases:
- News
- Weather
- Sports
- Calendar
- Communication with friends
- Reminders
- Home entertainment
- Timers
- Math calculations
- Recipes
- Trivia
- Questions about the assistant itself
- Swearing at the assistant
- Meaning of life questions
- Etc.
The list will be large, but humans are (usually) remarkably simple and predictable beings. We wake up, we want the news, we make coffee, we eat breakfast, we go to work, we sit in traffic, we talk to our friends, we sit in traffic, we come home, we watch television, we get ready for the next day, we goo to bed. And we do this a number of times before dying.
Those are use cases—all of which need their own potential invocation options mapped. But they don’t need to be perfect. They only need to hit that magical number of “very high” confidence.
By definition, most people will use the most common ways of asking a given question, and once you map the space around those most common methods adequately I think we will be able to reach “good enough” fairly easily. It’ll be hard work initially, but it’ll level off quickly because human speech doesn’t evolve quickly enough to present this as a problem.
Nobody’s going to come home one day and say, “Illumination request initiated.”, or “Lux go now”. And even if they did, they shouldn’t really expect that it would work, and wouldn’t be upset that it didn’t.
In short, I think Benedict is overestimating the number of combinations that need to be mastered to hit the feeling of “Minimum Necessary Confidence” that’s required to transition voice interfaces from novelty to everyday infrastructure.
To his credit, he’s also said he could be wrong about this. And I’ll say that as well. It could be that I’m wrong about how many combinations there actually are for each use case, thus making the mappings prohibitively numerous to achieving Minimum Necessary Confidence.
We’ll have to see, but I’m betting this will just take 2-3 years to get us there for most of the use cases I’ve listed above.