Google’s AI is getting smarter everyday. It helped reduce energy consumption and increase efficiency at Google’s data centers, encrypted conversations between two AIs and now it can add Lip Reading to that list. Researchers from Google’s AI division (DeepMind) and Oxford University have created software that analyzes lip movements to effectively read lips better than humans. It is apparently, the most accurate lip reading software ever created.
This software was created by analyzing thousands of hours of TV footage from the BBC. Scientists trained a neural network to read the lips of newscasters and guests alike and annotate the video. They got an accuracy of 46.8%. Now, that may not seem like much since it doesn’t even qualify as a pass. However, when human beings were given the same footage to analyze, they only got the accuracy down to about 12.4%. That’s major. If you consider how humans are technically light years ahead of artificial intelligence in terms of recognizing faces, understanding the meaning behind speech (no matter how cryptic or sarcastic) and reading facial expressions, this is phenomenal.
An Oxford University research team did something similar earlier this month. It created a software called LipNet to read lips using similar techniques. They were able to achieve 93.4% accuracy during testing compared to a human accuracy of 52.3%. The difference between the two software and testing was that the latter used specialized recordings with volunteers speaking “formulaic sentences”.
The former research (involving DeepMind) went to town on random footage involving natural and unscripted conversation. It had all sorts of sentences, phrases, colloquialisms and accents in it. DeepMind’s software was known as “Watch, Listen, Attend and Spell”. The BBC footage analysed included clips of Newsnight, Question Time and The World Today. All in all there were 118,000 different sentences, 17,500 unique words. By comparison, LipNet’s research had only 51 words.
This software could help deaf people understand conversations or help improve digital assistants like Alexa/Siri/Google Assistant so they can read you mouthing words; the latter could be helpful in public places with a lot of noise. Another cool use could be the annotation of silent films or the analysis of security footage for crime scenes. Now bear in mind, this is only preliminary and the software can’t even recognize lip movements more than half the time. So it’s a long road to catching someone threatening a couple on CCTV footage that has less pixels than a 144p video on YouTube.