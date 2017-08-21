MICROSOFT HAS announced a new milestone in its quest for parity with humans on speech recognition.

An announcement from Xuedong Huang, a Microsoft Technical fellow, reveals that the firm's latest tests show a 5.1 human parity word rate error, an improvement from the 5.9 previously announced, which was already better than that of a regular, casual human conversation and twice that of a Loose Women panellist.

The results are based on the standard Switchboard test, a measure based on a flurry of conversations which the machine being tested then transcribes.

Here comes the science bit: "We introduced an additional CNN-BLSTM (convolutional neural network combined with bidirectional long-short-term memory) model for improved acoustic modelling. Additionally, our approach to combine predictions from multiple acoustic models now does so at both the frame/senone and word levels."

Microsoft goes on to explain that the system can now tune itself better to the language model of the individual's previous conversations to better predict what it would say next to improve the topic and context - something that its current public bots are doing less sell at.

The post warns that the community still has much to do. Noisy surroundings, far away mics, weird accents and more fundamentally, speaking styles and languages where there is limited training data.

In a week when the BBC has announced a dedicated Pidgin English language radio service, how likely is it that AI will actually be able to speak it or even interpret it as a different, valid language from British English?

Also, an issue is that computers understand the words, and can contextualise them and add meaning. An error rate is one thing but to meaningfully interpret speak, computers will have to learn to better understand them, with Huang adding "Moving from recognising to understanding speech is the next major frontier for speech technology."

Mozilla is in the midst of the process of publicly collating new data to make its speech library more accurate. The company then plans to release the audio data as a library to make it easier for smaller businesses and individual coders to incorporate speech data. µ