To a casual observer, it might appear that "voice interfaces" to computers—like Siri or Alexa—are a single technology space. In fact it’s useful to think of them as two problems combined.
First, the computer needs to fully understand exactly what you said. That means deciphering mumbling, removing background noise, handling different voices and accents, etc. That’s difficult, but we’re getting better at it.
Second, the computer needs to understand what you meant to do. This is difficult because it means translating and mapping that input to existing commands, and then executing them.
These are very different problems. The first one is called Voice Recognition, and the second is called Natural Language Processing.
[ NOTE: Natural Language Processing and Neuro-linguistic Programming share the NLP acronym, but they’re quite different. Most notably, Natural Language Processing is a real and developing science and Neuro-linguistic Programming is mostly debunked pseudoscience. ]
You can have a system that’s great at figuring out exactly what you said, even if you mumbled or have a thick accent, while speaking quietly on a subway, but has no idea how to turn your sentence into actions it can perform.
As an example, you might mumble:
Find a better song than this garbage.
If this system is limited with a few hardcoded commands, such as PLAY $ARTISTNAME, then the system will respond back with an error, or a request for clarification, because it didn’t hear the keyword PLAY.
Conversely, you could have a system that could perfectly understand that sentence, except when you say it—even in a relatively quiet setting—it instead hears:
Fire the buttress log on the garage.
Again, one side of the system let down the other side, and the system as a whole responds with an error or an additional prompt.
The key point here is that the system is generally only as good as the worst side of this equation. Voice interfaces continue to become more usable because they’re advancing in both of these areas simultaneously, and they’re incorporating the improvements of each into new iterations.
Voice interfaces to computers require both voice recognition and NLP.
These are quite separate and it’s possible to be good at one and bad at the other.
The system overall can only be as good as the worse of the two.
We’re seeing improvements in voice interfaces because both sides are improving simultaneously.
The next time you interact with a voice system, and it fails, think about which of these to components was responsible.
I’m not an expert in this field, but I am willing to wager that each of these two categories (Voice Recognition and NLP) likely break into many others. I think it’s useful, however, to think about them as two components in many contexts.