IBM speech recognition becomes as accurate us humans now

Fariha Khan
March 9, 2017
658

According to IBM, humans tend to misunderstand or mishear up to 5 to 10% (on average) of all words they hear from other humans in general talks. While it may seem to be too much, our minds can recompense for this well and so we don’t bother to even notice really. However, just as humans, computers also have related problems of misunderstanding words which makes their work more challenging.

IBM has just released a blog post that states their new achievement in their mission for perfect informal speech recognition. The firm has created a machine that has “reached a new industry record of 5.5 percent,” when it comes to percent of words that are unrecognizable by the software in a conversation. The company has achieved a major breakthrough in this field in 2016 - a computer system that reached a word error rate of 6.9%. However, it was by making use of the SWITCHBOARD linguistic corpus that the firm was able to achieve their latest highest. This brings them closer than ever before to what they thought to be the human error rate, 5.1%.

According to a blog post by George Saon, Principal Research Scientist at IBM:

“To reach this 5.5 percent breakthrough, IBM researchers focused on extending our application of deep learning technologies. We combined LSTM (Long Short Term Memory) and WaveNet language models with three strong acoustic models. Within the acoustic models used, the first two were six-layer bidirectional LSTMs. One of these has multiple feature inputs, while the other is trained with speaker-adversarial multi-task learning. The unique thing about the last model is that it not only learns from positive examples but also takes advantage of negative examples - so it gets smarter as it goes and performs better where similar speech patterns are repeated.”

Julia Hirschberg, a professor and Chair at the Department of Computer Science at Columbia University, expressed,

The ability to recognize speech as well as humans do is a continuing challenge, since human speech, especially during spontaneous conversation, is extremely complex. It’s also difficult to define human performance, since humans also vary in their ability to understand the speech of others. When we compare automatic recognition to human performance it’s extremely important to take both these things into account: the performance of the recognizer and the way human performance on the same speech is estimated.

They toiled to replicate human-level results with their associate Appen, and determined human performance is still a bit better compared to a machine’s, at 5.1%. In addition to others in the field, they have been trying to achieve this for quite a while now. Some have even claimed to have achieved this recently asserting it to be 5.9%. While our development of 5.5% is a major one, this shows us that there is much more to do before anyone can say the tech is as good as humans in its true sense. Finding a standard extent for human parity across the industry is more intricate than it looks, and we must remain responsible to the highest standards of precision when measuring for it.