Lip-reading automatons: Multimodal speech recognition

(1)

Further scientific news by TU Delft Colophon DO-Archive

Lip-reading automatons

Multimodal speech recognition:

(2)

Marion de Boo

Just imagine that you are standing in the concourse of Rotterdam Central Station, and you can speak into a machine to ask it the time of the next train to

Amsterdam, and an electronic voice will instantly tell you the answer, including the platform number. The TU Delft Mediamatics department has been collaborating for some years with OVR (Openbaar Vervoer

Reisinformatie), a company that provides public transport information, to create systems for automatic speech recognition.

[See also Delft Outlook 2002.1.]

So far, the results have been nothing to write home about, certainly not when the information was requested from noisy places like station platforms. If the voice of the passenger on the platform is drowned in ambient noise, with its mixture of announcements including delayed trains, the computer gets confused. It is an established fact that other people are much easier to understand if you can see as well as hear them talk. It is not just the deaf who use lip-reading, for people with normal hearing will also resort to watching the

speaker’s mouth as the level of ambient noise increases. This has led to the idea of supporting automated speech recognition systems with software for automatic lip-reading. The system could also come in useful for hands-free phone calls in cars. A small camera could be pointed at the mouth of the speaker and a processor could analyse the video images in real time. Polish IT engineer Jacek Wojdel has developed a working prototype.

Automatic speech recognition has been the focus of worldwide interest for over two decades. International companies have large research departments working on it. At Philips in Aachen, Germany alone some 150 researchers are active in the field. ibm has developed the ViaVoice Speech System, and the Belgium company of Lernout & Hauspie, which recently went bankrupt, was also a major player.

‘I know, it’s kind of ambitious to think that a single doctorate student working on the subject for four years would be able to make much of a contribution,’ Wojdel’s supervisor, Dr. Leon J.M. Rothkrantz of the Mediamatics department chuckles, ‘but this is an interesting field of research, so we decided to tackle it nevertheless.’

The possible applications of automatic speech

recognition, supported by lip-reading, are numerous. To start with, there are the deaf and hard-of-hearing who could benefit, and it could also be used as a support system in speech therapy. The speaking computer could also help cut staff costs in a number of information systems, when booking train tickets, flights, or hotels, for example. You could also ask the system to dial numbers on your car phone without having to use your hands. And, lip-reading could help increase the reliability of speech in video-phone systems.

As part of the MASK project, an experimental electronic information kiosk at the Paris St. Lazare station uses touch screen and speech recognition technology. The idea is to inform passengers in the most efficient and comfortable way about train departures and other public transport connections. Background noise remains a major disruptive factor for the speech recognition system when trying to make sense of what people are saying.

The phone desks of the ANWB (Dutch Automobile Association) are currently manned by operators who record relatively simple information from stranded drivers to direct the motorway patrol to the right location, but sometimes they also get reports of accidents. If the ANWB were to decide to switch to computers with speech recognition systems to service the motorway emergency phone network, a way will have to be found of eliminating the plethora of background noise.

(3)

Platform info kiosk

OVR employs some 350 telephone operators, but during peak hours, the large numbers of callers sometimes have to wait a long time before they can be served. It would be nice if an automatic speech recognition system could handle such a simple request as «when does my train leave», so the operators can concentrate on the more complicated requests. The system could also prove valuable during the night, when the OVR cannot be contacted, but the trains are still running.

Rothkrantz: ‘Five years ago, we experimented a bit with a speaking computer and speech recognition software developed by Philips. It worked, but is was hardly perfect. Train passengers gave our speech recognition system a score of sixty percent, while the telephone operator got eighty.’

Nevertheless, even the telephone operator did not get a perfect score. In some cases she would be rather impatient and terse when passengers were slow in formulating their question. Some of the tested subjects said the advantage of the «speaking computer» was that it could be contacted day and night for any request. There are no stupid questions for computers; they are always patient and anonymous. At the moment an automated telephone information system on public transport is up and running.

The next idea was to provide the automatic travel information not only by telephone, but also through automatic kiosks in stations. Five years ago, experiments were conducted using these systems in Rotterdam and in Utrecht. However, the running costs of the kiosks were a bit of a setback at over B 1.10 per minute.

Rothkrantz: ‘What you want is a system that will listen to speech, and at the same time block out any

background noise. The performance of the system could be improved by fitting a transmit button, but in that case the passengers would have to know how to use them.’ The sncf, the French railways, ran an experiment in which the passenger had to stand on a small platform. The kiosk worked only if the person weighed more than 50 kilogrammes, to prevent the system being used by playing children.

Lip-reading helps

Rothkrantz: ‘We think that you can get better results by integrating speech recognition and lip-reading. This is why we came up with the plan to fit a kiosk with an unobtrusive camera that would look at the speaker’s face to detect the mouth. As soon as the mouth starts to move, the system is activated and starts to listen. When the speaker stops moving his mouth, the system stops analysing, even though the background noise goes on.’ This was the brief for doctorate student Jacek Wojdel when he started his research in 1998. At secondary school, Wojdel represented Poland at the International Physics Olympics at Helsinki, and he went on to study Applied Physics and Information Technology at Lodz University. He specialised in artificial neural networks, fuzzy systems, and artificial intelligence. In 1996/1997 he was in Delft to complete his Master of Science thesis

For some years now, tourist destinations in Austria have featured electronic kiosks providing information on such subjects as hotel accommodation, ski slopes, and public transport. The kiosks feature a keyboard, touch screen, and speech recognition, and are connected to the Internet.

OVR, the Dutch public transport information company, employs some 350 telephone operators who handle over ten million calls each year. As an additional feature, the system now uses automatic services using speech recognition. Research has shown that the phone operators are much better than speech recognition systems at making sense of calls with lots of background noise.

Recognition percentages of an automatic speech recognition system, an automatic lip-reader, and a combination of both systems at various signal-to-noise ratios. The data were obtained from an experimental system using only separate words from a known word list. The use of lip-reading proves to dramatically increase the «intelligibility». Even at -10 dB, the combined method would score an 65% success rate.

(4)

on recognition of the shape of the mouth in video sequences, for which he got a first with honours. In 1998 Wojdel, who by then had mastered the Dutch language, started the research for his doctorate at TU Delft.

The research focused on two questions. The first was, how does a camera recognise when a person starts or stops speaking? This is known as the on-set/off-set problem. The second was, is it possible to read a speaking mouth even without sound? For multimodal applications, the two data streams, sound and images, will have to be linked. A second doctorate student, Pascal Wiggers, has taken up that challenge. A major problem is that the image and sound streams are not completely synchronous. For example, if a person utters the letter P, you will first see his mouth close, and hear the P afterwards.

Fridge conversations

‘Lip-reading forms part of day-to-day human communication,’ Wojdel says. `The aim is to create computer systems that we can communicate with just as easily and naturally as we can with our colleagues.’ In the past few decades, computers have improved mainly in the technical sense. They have become powerful and extremely fast, cheap, energy-friendly, etc.. The user interface on the other hand is still in its infancy. The fridge that you can simply talk to in order to tell it what to order from the grocer’s [see D.O. 2001.4], is not for sale yet. Multimodal applications using computers are slow in coming. The Ubiquitous Computing (Ubicom) programme, led by Prof. Dr. Inald Lagendijk, emphasised the omnipresent computer, a system that you would always carry with you, and which you could use to check your e-mail messages on the road, or to ask it for directions when you get lost. But it would be very awkward for instance if it were raining to have to operate a keyboard to use the system instead of simply talking to it. The successor to Ubicom, the Cactus programme, focuses more on the interface side of computer use, and the most natural interface has to be human speech. Multimodal speech recognition is a highly promising field of research.

Wojdel: ‘My experiments showed that lip-reading, i.e. the automatic analysis of video images of the mouth, in a quiet environment did not contribute much to speech recognition, but as background noise increases, so does the contribution made by the video information to improving speech recognition. The thing is that speech signals degrade very rapidly, and that is when lip-reading comes in handy. When you’re at a party with loud music, you automatically start to use lip-reading.’ As the background noise increases, lip-reading becomes more important to help you understand another person. The geometry of the mouth

How does one teach a computer to read lips? ‘In the first place, it is a question of lots and lots of signal processing’, Wojdel says. ‘In this kind of research, you soon end up with a massive flow of data, an endless stream of numbers, for which there were no previously available processing methods. So we had to start by

Combining speech recognition and lip-reading introduces the problem of lack of synchronisation between the audio and the video signals. In this example, the sound signal lags 40 milliseconds behind the video track. The problem is caused by what is known as co-articulation, which results in the «n» sound being pronounced differently when it is next to an «a» sound than when it is next to «oo».

(5)

finding a smart method that could extract the lip image data from the video images.’

A video image consists of 350 x 280 pixels. At 25 of such images per second, the data stream of a video movie can take on massive proportions.

Wojdel: ‘Any two year old looking at the video can tell you at a glance whether the speaker has opened or closed his mouth, as simple as that. But to a

computerised system, recognising this type of on/off signal is a major mathematical operation.’

A classic method of tackling this problem is to use the mouth’s geometrical data. You could measure, for example, the height and width of the mouth’s contours in a video image to find out whether it is open or closed. ‘However, for this to work, you first need a geometrical model that closely fits the image’, Wojdel says. ‘To start with, you could collect a set of 25 template contours of a speaker’s open and closed mouth producing «oh» and «mmm» sounds, and then try to match your video image of an arbitrary mouth configuration to one of those. This method could be summarised as «fit the model to the image as closely as possible, and measure the model».’

Having done this, you can then represent a video image as a matrix of numbers. An alternative solution uses statistical operations on the data streams. Each number represents a grey value of the video image. As the number increases, so does the brightness of the image. Wojdel: ‘The statistical approach is used in medical research that compares enormous data sequences in order to find out what causes cancer. The use of statistical analyses on a huge pool of data can help you discover that smoking increases the probability of a person contracting cancer. The method has proved its worth, but you need a huge amount of data, in this case from video recordings of the speaking mouth, to be able to use it successfully.’

Wojdel managed to strike a happy medium.

‘After all, I’m not really interested in the contours of the mouth,’ he explains, ‘since these contours vary too much according to the person they belong to, which makes the method much too sensitive to noise. The purely geometrical approach is too abstract for my liking, but to process the data as an undefined series of numbers also seemed a bad idea to me.’

Wojdel’s approach was to represent the geometry of the mouth by an estimate of some of its statistical

properties. The first step in the process of extracting data from the video sequences is colour filtering. The lips are redder than the rest of the face. If you use a filter with a high response to the colour red, you can determine which of the image’s pixels form part of the mouth. The image obtained through filtering is then treated as a statistical distribution of probability. The centre of the distribution is assumed to be the centre of the mouth. You can compose a formula that takes the distance of the lips to the centre and the visible thickness of the lip. The resulting values form a vector that acts as the input for a neural network.

Wojdel: ‘ This approach has given me a very robust method with a high insensitivity to image noise, and

(6)

very independent of the personal characteristics of a test subject.’

Image noise can be caused by a number of factors. The image colours may be shaded, or surrounding light sources may cause reflections. Noise also results if the test subject’s head moves while the images are being acquired. All these factors make it difficult to recognise the speech signal. The personal characteristics of the test subject may not hinder the speech recognition process. As a prerequisite for the development of a multimodal speech recognition system, the movements of the lips must be rendered so as to be independent of the subject’s person.

Impersonal

Research in the field of man/machine interactions follows two main tracks. One searches for unique personal characteristics that can be used to

unambiguously determine a person’s identity to allow identification at, say, a cash machine, or an airport check-in desk. The other looks at systems that can be used by anybody, in which case the user’s personal characteristics must not cause any interference.

Rothkrantz: ‘An automated system for lip-reading must always be capable of extracting the correct information from the video images, and interpret them. The user’s appearance should not matter at all. On the other hand, you would have to train the system for each new person, for males or females, large or small mouths, thick or thin lips, etcetera, the same way you have to train speech recognition software to get used to your own voice and pronunciation.’

One personal characteristic is the redness of a person’s lips. One person has pale lips, the next uses a bright red lipstick or some other colour.

Wojdel: ‘To process the data, you could use a simple colour filter that looks only at a limited range of shades of red and discards the rest. On second thoughts, however, I would prefer a black-box approach, in which you use the first image to calibrate your observations for the test subject. You can then use that to train an

artificial neural network (or ann) to indicate which parts belong to the mouth and which do not. This works well in practice.’

The main challenge is obtaining person-independent data.

Wojdel: ‘Mouths vary enormously in shape. Caucasians have thinner lips than Negroes, and when one of my students, who is of Asian descent, closes his mouth, it looks almost the same as when I say «uh», a narrow mouth with thick, pursed lips. The ann speech

recognition system has to be able to recognise this. The user has to speak for some 30 seconds before the software can tell what that person’s average mouth shape is. Rather than looking for a person’s exact shape of mouth, we’re trying to obtain the movements of the mouth, independent of the subject.’

Training neural networks

Wojdel compared different types of artificial neural networks. They were scored according to their ability to recognise the mouth as a facial feature and their ability

A geometrical template is often used to determine the position of the mouth. An algorithm looks for matching image transitions to detect the difference between the lips and the rest of the face. Based on these data, a template is matched to the transitions (i.e. the mouth) in a number of iterative steps. The method is not very robust, however, and often yields incorrect results when the lighting of the face changes, for example. If the template location is incorrect, any subsequent steps are also likely to be erroneous.

Using the outline of the mouth as a parameter for lip-reading. Finding the outline was easy enough, but the data resulting from the test proved to be too dependent on the test subject’s characteristics. The outline of the mouth on the left appears to indicate silence, while the outline of the mouth on the right appears to indicate an «a» sound, whereas in fact both mouths are silent.

(7)

to classify images from the video stream according to three categories: vowel, consonant, or silence.

The neural network primarily assesses the dynamics of changes. If the images change rapidly in succession, the system deduces that the test subject is speaking. If the person’s mouth is closed, the speaker must be silent, and if the mouth closes for only a very short moment, it must be a consonant such as M or P. A neural network can be trained to spot such transitions of changing dynamics. The system gradually learns to recognise the pattern associated with the letter M or P. The ann speech recognition systems were trained to recognise speech boundaries, and to distinguish between vowels and consonants.

Wojdel has managed to find the right ann architectures for the purpose. ‘The black-box approach is used to run a series of experiments, and eventually you end up with the network that offers the best performance.’

Neural networks come in different types: feed-forward, time-delayed, and recursive. Feed-forward networks proved to be unsuitable for the task, being bad at recognition. The time-delayed neural network

performed a little better. A recursive network feeds its input with the output generated by its own system, which makes a recursive network difficult to keep stable. The Elman hierarchical neural network (a recursive network) proved to be the most efficient one for distinguishing between vowels and consonants, so that was the one Wojdel picked for further experiments. He trained four Elman neural networks of different sizes, using patterns of 0 to 3 images. Zero images meant silence, with a single sound always taking up more than one image. In the word «mama» for instance, the M takes up three images, and the A takes up no less than five images. The system recognises these letters partly through the context of the adjacent letters. For some consonants, the context is the only way they can be recognised. Partially recurrent neural networks produce the best results. It will take some time before Wojdel’s system will recognise any sound without errors, but it is already capable of distinguishing between vowels and consonants.

Five sentences

The first half of the research was carried out using a quantity of audiovisual data selected from the available scientific literature. Wojdel chose five sentences containing all the Dutch phonemes. A phoneme is the smallest meaningful unit of sound. Three test subjects were asked to read out five sentences at normal speed, then three sentences were repeated slowly, followed by two sentences at a whisper. The result was a data set of 30 sentences recorded on video.

Next, the 5576 separate video images were analysed. The maximum elongation of the mouth occurred with the vowels. Some consonants, such as P or L, were also easy to recognise from the configuration of the mouth. The P was the last video image before the mouth opens again, the L was the video image that shows the underside of the tongue between the teeth, and T was the video image halfway through an audibly pronounced T.

Schematic diagram of the lip-reading method developed by Wojdel. It starts with a video image that is passed though two different colour filters. The diagram shows two of the five filter methods used by Wojdel. Ideally, the filtered images should only show the mouth, but in practice the images will always contain sections of other elements. Therefore Wojdel used statistical methods to estimate the geometry of the mouth.

Although the geometry of the mouth is essential for lip-reading, it does not contain all the necessary information. Wojdel discovered that data on the visibility of teeth and tongue provided useful information. His algorithm was able to estimate both parameters based on the size and location of light and dark areas in the image.

(8)

For other consonants, such as H and K, it was difficult to find an unambiguous image, so the image between two adjacent images was selected. Initially, instead of using three test subjects, Wojdel had started to work with ten people, both native speakers and non-native speakers. However, processing the video recordings turned out to be so time-consuming that he had to limit himself to processing the data sets of three subjects, with the first person speaking at normal speed, the second speaking slowly, the third whispering. These three sets were used as a test set, and the remaining 27 sets were used to train the neural network, using the Stuttgart Neural Network Simulator (snns). The test results were good. Wojdel’s model proved to be readily capable of distinguishing between a test subject’s speech and silence.

Promise

Next, Wojdel continued with a new, wider data set that included sound recordings, so sound and image integration experiments could be conducted. For these experiments - some three hours of speech recordings - five test subjects were asked to read out a series of 10 digits, pausing between the digits, and speaking at varying speeds. The system managed to name the digits with an average accuracy of 70 percent. If the system was given the number of digits pronounced by the test subject, it got about 80 percent right, and if no

information was given about the number of digits, the recognition rate fell to 60 percent. A deaf test subject, who had learned lip-reading at the Effata Institute for the deaf at Zoetermeer recognised 65 percent of the digits. A major source of errors was the fact that the system often recognised silences either too late or too early.

‘The system is still far from perfect, but the method does show lots of promise,’ Jacek Wojdel concludes. When he is awarded his doctorate in the spring of 2003 he will remain working at TU Delft for at least another two years for research on the integration of speech recognition and lip-reading.

‘Just as well,’ Rothkrantz says. ‘The great thing about this research is that it combines so many different disciplines, linguistics, image processing, and audio research, which is the way we prefer it at TU Delft. These are all very different fields, and you will never find them together at a scientific conference. In the lip-reading research they all come together. The TU Delft did not have much of a reputation in the field of speech recognition, even though it is a hot item. We hope we have managed to put ourselves firmly on the map in one fell swoop.’

For more information please contact

Jacek Wojdel M.Sc., phone +321 15 278 8543, e-mail j. c.wojdel@cs.tudelft.nl, or

Dr. Leon Rothkrantz, phone +31 15 278 7504, e-mail l.j. m.rothkrantz@cs.tudelft.nl

If the data of the speech analysed using Wojdel’s method is visualised, we can see that similar-sounding phonemes form clear groups. We find «a» sounds next to other «a» sounds, and the same goes for «o» sounds. The phonemes that cannot be distinguished by sight, such as P and B, are also very close together. The visualisation proves that Wojdels data analysis can be used as a basis for lip-reading.

Schematic diagram showing the architecture of the three neural networks used by Wojdel for his speech data analysis.

Wojdel’s system, using a neural network, can determine whether a person is speaking and whether the sound produced is a vowel or a consonant. The system uses the Elman Hierarchical Neural Network, which is a partially recursive network.

(9)

This interface shows that the automated lip-reader is operational. The lip-reading machine is able to recognise a serie of digits. It uses the so-called Hidden-Markov models for the recognition process. The red rectangles represent the correctly detected digits. The pink rectangles represent probable and less probable alternatives for the detected digits. The bar at the bottom of the window shows the actual digits pronounced. The recognition success rate using this method is about 80 percent.