Skip to Content

Taking minutes by digital secretary

Australian Defence Science
Winter Issue: Volume 14 Number 1 2006

The advent of the automated digital minute-taker is fast approaching for the Australian Defence Force, with new speech recognition applications being readied for use by DSTO. The result is Multi-Speak.

The significance of the work begins with the widely held view that speech is the most natural form of communication between humans, and that human interactions with computers are also more efficient when carried out using speech rather than keyboard and mouse.

Correspondingly, the command post of the future is envisaged as an environment with embedded communications technologies that listen to the utterances of personnel, responding to voice command requests for information from computer-controlled networks and providing automatic transcription of discussions. Meanwhile, instructions to people are delivered by computer as near-natural synthesised speech. The advantages these technologies offer are to improve the efficiency of information management, collaborative planning and decision-making processes.

Prototype facilities including the Future Operations Concept Analysis Laboratory (FOCAL), the Intense Collaboration Space (ICS) and the Deployable Joint Force Headquarters (DJFHQ) are being used to evaluate such technologies.

Automated data entry

One area of speech recognition work undertaken by DSTO is the automation of data entry.

According to DSTO researcher Ahmad Hashemi-Sakhtsari, DSTO has integrated the commercially available Dragon Naturally Speaking speech recognition software as a speaker-dependent speech recogniser with Lotus Notes database and Excel spreadsheets for use in DJFHQ. He says, “Personnel using the system are able to select forms and sections in the forms and enter data in different fields using direct speech input rather than the keyboard and mouse. The range of processes using speech input includes formatted data entry, form-filling, and free dictation.”

The voice of the data entry operator can be recorded by a microphone headset, or by an array microphone mounted on a desk or wall connected to a computer, meaning that the operator need not necessarily be tethered by cable to the system.

Before Dragon NaturallySpeaking can be used, a speech profile has to be generated for the user. This involves the process of training the software to decipher the prosodic features of individuals. By reading through passages of text, training models or templates that contain characteristics of each speaker are generated from units of sound known as phonemes.

To enable the use of Dragon NaturallySpeaking for Defence data entry purposes, DSTO has compiled a specific vocabulary of military acronyms and abbreviations as a web-based acronym manager that contains 54,000 acronyms and associated definitions.

Automatic voice transcription

For transcription of group discussions, the DSTO team has developed an application known as the Automatic Transcriber of Meetings (AuTM).

Each meeting attendee wears a microphone headset, and one or more participants have available a computer where all of the spoken utterances are converted to text and displayed.

Using Dragon Naturally Speaking software, the AuTM application records the words of each speaker as separate audio files, logs the start and stop times of every utterance, and transcribes these into text for display on the computer screens. Each delivery from the same speaker is colour-coded for easy visual identification of his or her contributions. The start times of agenda items are also recorded.

A moderator can step through agenda items and highlight significant aspects of the discussion such as motions and action items. AuTM can run on a single computer, or on multiple computers as a distributed application, with all of this information being displayed at near-real time rates.

The AuTM program highlights overlap between spoken utterances from the participants.

Later correction of speech recognition errors can be carried out off-line by checking the transcribed text against the matching audio file recording. The final transcript, as an HTML or Word document, can be processed by text summarisation and concept mapping tools.

The work on AuTM has now been readied for market in the form of a product called Multi-Speak, which is being released by Voice Perfect Systems under a commercialisation and licensing agreement with DSTO. Multi-Speak produces a fully interleaved and attributed text and audio record of the meeting, even when people speak simultaneously.

Factors that affect performance

The DSTO researchers have considered a number of factors that could influence the usability and effectiveness of transcription services in Defence.

These include background or environmental noises, and room reverberation. As countermeasures, the DSTO team has developed adaptive noise cancellation techniques for incorporation into meeting rooms.

The research is also investigating the effects of ‘disfluencies’ (speaker-generated artefacts) such as ‘ums’, ‘urs’, coughs and tongue clicks. By modelling disfluencies and subtracting these artefacts from the speech to be recognised, the recognition vocabulary can be restricted to legitimate utterances, and so, will result in a more efficient transcription process.

Consideration is also being given to the effects of telephony bandwidth limitation on the performance of speech recognition systems. This is to better understand the application of the technology in teleconferencing situations.

Speaker related research

DSTO’s speech processing research to date has looked at situations in which one speaker is close to another, and hence, co-channel interference as well as primary speech are recorded.

Speaker separation processes allow clean speech from each speaker to be separately transcribed. Separating speech from a group of speakers using a single microphone is a more challenging aspect of the work.

To capture and monitor the work activities of several individuals in a room, it is necessary to locate and track speakers as they move around. Speaker localisation and tracking applications used in multimedia teleconferencing can be used to steer microphone arrays and cameras to capture individuals as they work, in order to ascertain, for example, particular patterns of behaviour.

DSTO, working with university researchers, is making good progress in this area of research.

Tagged in: