 |
SpeechBuilder
Introduction
The SpeechBuilder utility is intended to allow people unfamiliar with
speech and language processing to create their own speech-based
application. The focus of SpeechBuilder version 1.0 is to allow
developers to specify the knowledge representation and linguistic
constraints necessary to automate the design of speech recognition and
natural language understanding. To do this, SpeechBuilder uses a
simple web-based interface which allows a developer to describe the
important semantic concepts (e.g., objects, attributes) for their
application, and to show, via example sentences, what kinds of actions
are capable of being performed. Once the developer has provided this
information, along with the URL to their CGI-based application, they
can use SpeechBuilder to automatically create their own spoken
dialogue system which they, and others, can talk to in order to access
information.
Background
SpeechBuilder makes use of human language technology (HTL) (e.g.,
speech recognition, language understanding, system architecture, etc)
developed by scientists in the Spoken Language Systems Group at
the MIT Computer Science and Artificial Intelligence Laboratory. Researchers there are trying to develop next-generation human language technologies which will allow users to converse
naturally with computers, anywhere, anytime. In contrast to many
current speech-based applications which constrain what a user can say
during a dialogue, their goal is to provide much more freedom to the
user in the way they talk with computers. In order to demonstrate and
improve this technology, they have created several conversational
systems which have been publicly deployed on toll-free telephone
numbers in North America, including the widely used Jupiter system for weather forecast
information, the Pegasus system for flight
status information, and the more recent Mercury system for flight information and
pricing. If you have not used these systems before, please try them to
see how this technology works! (i.e., donate your voice to
science!) If you are in the Boston area, you can visit the MIT
Museum and try talking to our systems which have a display for
output.
Although these applications have been successful, there are limited
resources at MIT to develop a large number of new domains. In order to
encourage and enable others to build their own domains, the
SpeechBuilder utility was created to make it easier for HLT novices to
create their own application(s), or for researchers learning about
speech and language to create a prototype application which they can
subsequently modify manually. If successful, this utility will
benefit others by allowing them to taylor an application to their
particular interests. In addition, it will facilitate the collection
of a wide variety of conversational speech data which can be used to
further improve the basic human language technologies used by these
applications. SpeechBuilder developers will also stress the ability
of HLT technology to be rapidly ported to a variety of application
domains with different vocabularies, grammars, knowledge
representation, discourse and dialogue structure.
Architecture
A SpeechBuilder application has two basic parts: first, the human
language technologies which perform speech recognition, language
understanding etc., and second, the application program which
takes a semantic representation produced by the language understanding
component and determines what information to return to the user. The
HLTs are automatically configured
by SpeechBuilder using information provided by the developer, and run
on compute servers residing at MIT. The application consists of a
program (e.g., Perl script) created by the developer using the Common
Gateway Interface (CGI) protocol, running on a CGI-capable web server
anywhere on the Internet. The semantic representation produced by the
HLTs takes the form of conventional CGI parameters which get passed to
the application program via standard HTTP protocols.
There are four CGI parameters which are currently used by
SpeechBuilder: text, action, frame, and
history. As may be surmised, the text parameter contains the
words which were understood from the user, while the action parameter
specifies the kind of action being requested by the user. The frame
parameter lists the semantic concepts which were found in the
utterance. In their simplest form, semantic concepts are essentially
key/value pairs (e.g., color=blue, city=Boston, etc).
More complex semantic concepts have hierarchical structure such
as: time=(hour=11,minute=30,xm=AM), or
item=(object=box,beside=(object=table)))
The following examples illustrate possible action and frame values for
different user queries:
turn on the lights in the kitchen
action=set&frame=(object=lights, room=kitchen, value=on)
|
will it be raining in Boston on Friday
action=verify&frame=(city=Boston,day=Friday,property=rain)
|
are there any chinese restaurants on Main Street
action=identify&frame=(object=(type=restaurant, cuisine=chinese, on=(street=Main,ext=Street)))
|
I want to fly from Boston to San Francisco arriving before ten a m
action=list&frame=(src=BOS,dest=SFO, arrival_time=(relative=before,time=(hour=10,xm=AM)))
|
Since a CGI program does not retain any state information (e.g.,
dialogue), the history parameter enables an application to provide
information back to the HLT servers that can be used to help interpret
subsequent queries. The history parameter usually contains the
contents of the resolved frame parameter of the previous utterance.
For example, in the following exchange the history parameter is used
to keep track of local discourse context:
what is the phone number of John Smith
action=identify&frame=(property=phone,name=John+Smith)
|
what about his email address
action=identify&frame=(property=email) &history=(property=phone,name=John+Smith)
|
what about Jane Doe
action=identify&frame=(name=Jane+Doe) &history=(property=email,name=John+Smith)
|
The remainder of this document provides information which is intended
to help developers use SpeechBuilder to create their own speech
application. The section on knowledge
representation provides information about the format used by
SpeechBuilder to specify concepts and provide linguistic constraint.
It also describes how a developer can produce hierarchical frames
though the use of bracketing in the example action sentences.
The section on using the web interface
describes the mechanics of how a developer actually uses the
SpeechBuilder utility. Finally, the section on creating a CGI application describes some of the
issues involved in parsing the frame and history parameters, and
provides details on how a developer can download a startup kit
(written in Perl) which provides a useful module for parsing these
parameters, as well as a sample application.
Knowledge Representation
Keys and actions
Semantic concepts and linguistic constraints are currently specified
in SpeechBuilder via keys and actions. Keys usually define classes of
semantically equivalent words or word sequences, so that all the
entries of a key class should play the same role in an utterance.
All concepts which are expected to reside in a frame must be a
member of a key class. The following table contains example
keys.
Key
|
Examples
|
color
|
red, green, blue
|
day
|
Monday, Tuesday, Wednesday
|
room
|
living room, dining room, kitchen
|
appliance
|
television, radio, VCR
|
Actions define classes of functionally equivalent sentences, so that
all the entries of an action class perform the same operation in the
application. All example sentences will generate the appropriate
action CGI parameter if they are spoken by the user. SpeechBuilder
will generalize all example sentences containing particular key
entries to all the entries of the same key class. SpeechBuilder also
tries to generalize the non-key words in the example sentences so that
it can understand a wider range of user queries than were provided by
the developer. However, if the user does say something that cannot be
understood, the action CGI parameter will have a value of
action=unknown, while the frame parameter will contain all the
keys which were decoded from speech signal. The following table
contains example actions.
Action
|
Examples
|
identify
|
what is the forecast for Boston
what will the temperature be on Tuesday
I would like to know today's weather in Denver
|
set
|
turn the radio on in the kitchen please
can you please turn off the dining room lights
turn on the tv in the living room
|
good_bye
|
good bye
thank you very much good bye
see you later
|
Note that capitalization is unnecessary in the example sentences. In
fact, since SpeechBuilder is case-sensitive, words should be
represented consistently in the example sentences. No form of
punctuation should be used.
JSGF Formatting
In order to allow a developer to efficiently convey minor variations
in structurally similar keys or actions, SpeechBuilder parses a subset
of the Java Speech Grammar Format (JSGF). Developers can use these
diactritics to specify optional words or word sequences, or can
specify several alternate words or phrases as part of an input, by
separating them by the vertical bar, |. The parentheses, (), (e.g.,
(one | two three)), are used to indicate that one of the
elements is to be used. The square brackets, [], (e.g., [please],
[one | two]) are used to indicate an optional item or items.
For example, a developer could enter the sentence, [please] (put |
place) the red box on the table. SpeechBuilder would automatically
treat this as if the developer had entered all four variations of the
sentence: put the red box on the table please put
the red box on the table place the red box on the
table please place the red box on the table
Regularizing semantically equivalent concepts
In addition to the standard JSGF markup characters, the curly braces,
{}, can be used to regularize the output form of a set of alternative
entries which are semantically equivalent. The default output form
for any key entry is just the value of the entry itself. For example
if a city class contains entries such as Boston and
Philadelphia, the corresponding frame representation would be
city=Boston, or city=Philadelphia, respectively.
In cases where there are alternative ways of saying the same semantic
concept however (e.g., Philly), the curly braces can be used to
produce a consistent output form for all of the alternatives. In this
example, this could be accomplished by having an additional entry of
Philly {Philadelphia} in the city class, or by modifying the
Philadelphia entry to be (Philadelphia | Philly)
{Philadelphia}. Either method would ensure that the use of the
word Philly would produce a frame output of
city=Philadelphia.
The ability to control the output form of a key entry gives the
developer added flexibility, and can make the application program
easier to create. First, the application does not need to know about
all of the possible variations for a given concept. Second, it allows
the developer the ability to perform simple translations. For
instance, in recognizing the phone number, the developer might add
translations like one {1} to ensure that the application
receives an actual numerical phone number. Similarly, a key
containing cities might map names to appropriate codes (e.g.,
Boston [Massachusetts] {BOS}), to reduce processing in the
application.
Since the action CGI parameter is determined entirely by which action
entry matched the utterance, there is no need for the {} markup
characters in the action entries.
Handling ambiguities
In order to deal with ambiguous keys, it is possible to force
SpeechBuilder to treat a set of words in an action as having
originated from a certain key. For instance, if Metallica is
both an artist and an album, and an example reads, tell me what
songs were sung by Metallica, then SpeechBuilder will, by default,
choose either the album key or the artist key arbitrarily (and
possibly incorrectly). In cases like this, the developer can specify
which key class to use by enclosing the ambiguous words with <>
diacritics, and placing the name of the key class in front, followed
by =. In the above example, tell me what songs were sung by
artist=<Metallica> would force SpeechBuilder to treat Metallica
as an artist when trying to generalize that particular sentence. Note
that a simpler way to handle ambiguity is to use unambiguous terms
wherever possible in the example sentences.
Occasionally there will be words (or possibly even phrases) with
multiple realizations in a domain, only a subset of which are
considered to be semantically important by the developer. A typical
scenario are words which have alternative syntactic uses. For
example, the word may can be used as a noun (I would like a
flight on may third), or as a verb (may I get a flight from
Boston to Dallas). If the developer simply enters the word
may in a month class, the output form month=may will be
produced for both example sentences. This outcome can be avoided by
using different representations for different variations of a word
(e.g., may vs. May). Since SpeechBuilder is case-sensitive, it treats
these variations as completely different entries (although their
pronunciations are the same). Note that the developer must take care
to use a consistent format thoughout the example sentences.
Specifying hierarchical concepts
SpeechBuilder allows the developer to build a structured grammar when
this is desired. To do this, the developer needs to bracket
parts of some of the example sentences in the action classes, in order
that the system may learn where the structure lies. If the developer
chooses to bracket a sentence, they must bracket the entire
sentence, since SpeechBuilder will treat it as the complete
description of that example. Also, SpeechBuilder only generalizes
hierarchy within a particular action class. This means that if two
separate actions exhibit similar hierarchical structures, at least one
sentence in each class must be fully bracketed by the developer.
To bracket a sentence, the developer encloses the substructure which
they wish to separate in parentheses, preceded by a name for the
substructure followed by either == or =, depending on whether the
developer desires to use strict or flattened hierarchy.
Bracketing results in SpeechBuilder creating hierarchy in the frame
parameter. Hierarchy can also be more than one level deep, and can
mix both types of hierarchy. Note that bracketing a sentence only
involves pointing out the hierarchy -- the keys are still
automatically discovered by SpeechBuilder.
Strict Hierarchy
The developer can specify strict hierarchy by using the == in
bracketing a subsection of text. When strict hierarchy is used, all of
the keys under the bracketed region are treated as they normally
would, and each key becomes a key=value pair within the subframe, as
described previously. This provides a consistent means to bracket
subsections, yet have each subsection retain the same keys and values
as the developer would expect in a flat grammar. It also provides
increased flexibility, since multiple levels of recursion will
generate a consistent structure which is easy for the application to
deal with.
For instance, if the following sentence was bracketed as Please put
source==(the blue box) destination==(on the table), its frame
would look like source=(color=blue, object=box),
destination=(object=table) With an extra level of hierarchy,
Please put source==(the blue box) destination==(on the table in the
location==(kitchen)) would become source= (color=blue,
object=box), destination= (relative=on, object=table, location=
(room=kitchen))
Flattened Hierarchy
It is not always desirable to receive all of the key/value pairs
within a hierarchical section of the grammar. For instance, a query in
a flight domain might read, Book me a flight from Boston to
Dallas. In this case, an example using strict hierarchy such as
Book me a flight from source==(Boston) to
destination==(Dallas), would result in a frame like,
source=(city=Boston), destination=(city=Dallas). However, if
the developer knows that sources and destinations are always cities,
it might be simpler just to receive source and destination as keys,
without the nested city labels. Flattened hierarchy allows the
developer to do exactly this.
When the developer specifies flattened hierarchy by using an = in
bracketing a subsection of text, instead of the value of the
subsection name being a set of key/value pairs enclosed in
parentheses, the value will be composed of all of the keys inside the
parentheses, separated by spaces. For the previous example, we could
bracket the sentence as Book me a flight from source=(Boston) to
destination=(Dallas), and the resulting frame would be
source=Boston, destination=Dallas (i.e., without any
city= key/value pairs or parenthesized subsections). Thus, if
the developer knows the type of key that will appear within a
hierarchical subsection, he or she won't have to deal with parsing the
key's name.
If more than one key is of the same value, SpeechBuilder will separate
those values by underbars. This allows developers to easily build
parameters which are made up of more than one entry from a class, like
a phone number, without having to dig through the key name for each
entry. For instance, if there was a digit class holding the digits
zero through nine, and the developer gave the example, Who has the
phone number number=(two five three seven seven one nine)
SpeechBuilder would return a frame containing number=2537719
(if the digits were keys, and were reduced with {}'s to their numeric
form.)
Although it might be possible to automatically determine which regions
should be flattened by the number of concepts inside, we chose to let
the developer specify the type of hierarchy they wanted so that they
could be assured of exactly how the system would act.
Using the web interface
Registering as a developer
To use the SpeechBuilder
web interface, you must register as a developer to get an account.
Whenever you visit the SpeechBuilder site, you will need to enter your
account name and password to gain access to your applications. If you
click on the cancel button in the pop-up login window the first
time you visit the site, you will automatically be taken to the
registration page to provide basic information such as your name and
email address, and select a developer id (your user name) and a
password. You can subsequently use this user name and password to
login to SpeechBuilder and create your speech applications. You will
be emailed a code number which you will need when you call the
developer telephone number to talk to your application.
Building a speech application
When you login to SpeechBuilder, you will be able to modify or delete
any of the applications which you have previously created, or create
new ones. To create a speech application, a developer needs to
provide to SpeechBuilder 1) a comprehensive set of semantic concepts,
and example queries for their particular domain (specified in terms of
keys and actions), and 2) the URL of a CGI script which will take the CGI parameters produced by for a user query,
and provide the appropriate information. Once this has been done, the
developer 1) presses a button for SpeechBuilder to compile the
information for their application into a form needed by the human
language components, 2) presses another button to start the human
language components for their application running (on our web
server), and 3) calls the SpeechBuilder developer phone number and
starts talking to their system. When the developer calls the
developer phone number they will be first asked to enter their
developer code number (which is emailed to you when you register). The SpeechBuilder system will then
automatically switch to your particular application which is
running.
Selecting a Domain
When a developer first logs in to SpeechBuilder, they are presented
with a domain selection menu which
allows them to select and edit one of the existing domains in their
directory. The menu also contains a set of buttons allowing them to
Add, Remove, Copy, and Rename domains.
Editing a Domain
Once a domain has been selected for editing, the developer is
presented with a domain editing
form. The upper right portion of this form contains a
summary of the semantic classes which have been defined for the
domain. There are three types of semantic classes which are listed.
The first two lists contain all of the names of the actions and keys
which have been explicitly defined by the developer. The third
contains a list of all of the hierarchical classes which have been
automatically determined from the developer's bracketing. These are
called H-Keys, for hierarchical keys, since they
represent concepts of hierarchy in the domain. This list can help the
developer spot mistakes, such as making typographical errors, or using
two different names for the same concept.
The domain editing form also contains a detailed listing of all the
classes in the domain. For each class, it tells you whether the class
is a key or an action, and allows you to change its designation. It
also gives Edit and Delete buttons for each individual
class, allowing you to modify them or delete them one at a time.
At the bottom of the list of classes is a text box, with a
corresponding Add button and drop-down box which allow you to
name a new class, and add it as either a key or an action.
Below the class list is a box where the developer can enter a URL for
the domain's CGI-based application. Whatever the developer enters
into this box will be automatically contacted by SpeechBuilder
whenever someone uses the domain. By changing this entry, the
developer can use an application located anywhere on the Internet, and
even switch applications when necessary.
Below the URL box, SpeechBuilder has a set of five buttons. The first
one, Apply Changes, makes any modifications made to the domain
permanent. Other action buttons, such as those that add or edit
classes, also make other changes to the domain permanent.
The Reduce button takes all of the example sentences given by
the developer and simulates running them through the final system,
showing a table containing the utterances and the CGI parameters which
they would generate, as depicted here. This allows the developer to
debug the domain and make sure that all of their examples work as
expected. Note that a developer may click on any of the reduced
sentences to see the parse tree
for that sentence.
The Build button tells SpeechBuilder to build all the necessary
internal files to actually run the domain. The Force Build
button is slightly more agressive at clearing and rebuilding the
internal files.
The Start and Stop buttons start and stop the actual
human language technology servers for that particular domain. The
domain which is run is configured as it was the last time the
developer clicked the Build button -- any changes since that
point, while being retained by SpeechBuilder, are not used in the
actual running system. In the current SpeechBuilder setup, the most
recent domain to be run is the one the user is connected to when they
phone SpeechBuilder, although in the future, we will have several ways
of allowing multiple simultaneous SpeechBuilder domains to be run.
Adding and Removing Entries
When the developer picks a particular class to edit, another section
of the SpeechBuilder interface appears. The class editor is identical for both keys
and actions, and contains two text editing windows. The first window
lists all of the entries in the current class. The developer can
select one or more existing entries from the list. The second window
is initially empty, and allows the developer to add entries. The
developer can type one entry per line, and when changes are applied,
these entries are made permanent and moved into the existing entry
list.
Next to the existing list are a pair of radio buttons, marked
edit and delete. If the edit button is selected when the
developer applies the changes, all of the selected entries are copied
to the second box and removed from the existing list. This allows the
developer to modify existing entries without having to retype them. If
the delete button is selected, then the chosen entries are simply
removed from the class.
XML representation
The concepts and actions specified by the developer are stored in an
XML representation which is stored on our local filesystem. Since the
SpeechBuilder utility is a CGI script, the file is modified every time
changes are made to the domain. If a developer wishes to edit the XML
file themselves, it is possible to download the XML file by selecting
that option at the upper left of the SpeechBuilder utility.
Similarly, it is possible to upload an XML file into the users
SpeechBuilder directory. Note that an uploaded XML file will
completely replace the contents of any existing XML file in that
domain, so care should be used when exercising this capability.
Also note that although an XML parser is used to check the syntax of
any uploaded XML file, a developer should use caution when editing an
XML file manually.
Deploying a speech application
As mentioned earlier, once a domain has been built a developer can
press the Start button to deploy an application domain. They
can then talk to their system by calling the SpeechBuilder developer
telephone number and providing their developer code (which is emailed
to the developer when they register).
Once a speech application is sufficiently robust, it may be possible
to deploy it on a wider scale to the general public via a toll-free
number. When a developer feels that their application is at this
stage, they should contact us via email to pursue this
matter further.
Creating a CGI application
In addition to specifying constraints and example sentences for their
domain, the developer needs to create the program which will provide
the actual domain-specific interaction to the user. To do this, the
developer needs to have access to a CGI-capable web server, and place
the script to be used at a URL matching the one specified to
SpeechBuilder. Because of the flexibility of CGI, it doesn't matter
whether the CGI program is actually a Perl script, a C program
pretending to be a web server itself, an Apache module, or any other
particular setup, as long as it adheres to the CGI specification. All
of our testing to date has been done using Perl and CGI.pm. We
provide each developer with a sample application domain when they
register, and provide a useful Perl module for parsing the semantic
arguments for developers creating their CGI script. Note that each
domain is initially set to a SpeechBuilder URL which will echo the
text parameter of the incoming CGI arguments.
Processing Input
The first thing the CGI script needs to do is produce valid HTTP
headers. Most CGI packages should provide the ability to do this
easily. The text, action, frame, and
history parameters are all passed as individual CGI parameters.
The first parameter is action, which simply tells which action
matched the user's utterance. When SpeechBuilder first receives a new
call, it sends the action ###call_answered### so that the
application can welcome the user to the domain.
The frame parameter contains information on which keys were
found in the parsing of the utterance. If the domain is hierarchical,
it may have several levels of hierarchy built in. Unless the domain is
very simple (perhaps containing one key per utterance), this variable
is very difficult to use in its default form. The application will
probably want to parse the frame (which has a fairly regular
structure) into some internal representation. For our domains, we
built a Perl function to parse a frame and return a hash tree
containing the structure of the frame. This means that at any level in
the hash tree, the keys were all of the key types that were found, and
the values were either the value of the key (if it wasn't
hierarchical) or another hash table containing the inner context (if
it was hierarchical). This makes it very easy to check whether
specific keys exist in the frame, or to extract hierarchical
information without trouble.
Generally, the application should first check which action was
given. If the action is unknown, then the script can either
attempt to get some information out of the keys in the frame, or
simply ask the user to try again. After the application has decided
which action it is dealing with, it needs to check the appropriate
keys. In some cases, the same action can use multiple sets of
keys. For instance, in the house domain with the action turn, a
room may or may not be present, depending on whether the user said,
Turn the lights in the kitchen on, or Turn all the lights
on. The script can simply check for the existence of certain keys
to determine which form was used, and take the appropriate action.
Generating Output
To tell SpeechBuilder what to say to the user, the program needs only
to print the English-language sentence, which will in turn be taken by
the CGI mechanism of the web server and sent to the speech synthesis
server running at MIT. This reply can only be one line long, and
must end with a carriage return. However, the line can be
essentially as long as the developer wants, and can contain multiple
sentences. In order to end a call, simply prefix the final response
with ###close_off###, and SpeechBuilder will hang up after
speaking the last sentence.
To use the history mechanism, the script should simply print a line
starting with history=. The line can contain any data the
developer wants to remember, and must also end with a carriage
return. Further, the history line must occur before the line to be
spoken, since everything after that is ignored.
The way we used the history mechanism in our test scripts was to
parallel the frame structure. In addition to a Perl function to parse
the frame, we created a function which would take a hash tree
structure (like that generated by the frame parser), and produce a
single-string history frame. This allows the application to make
changes to such a structure to keep track of the current focus (such
as who the user asked about last). The script can then encode this
into a history string, and when it is received by the next call of the
script, decode it back into the structure it started with.
By doing all this, a script can have a fairly complex interaction with
a user, understanding what the user requests, responding
appropriately, and keeping track of the course of the conversation as
it goes -- all using some very simple mechanisms to interact with the
main SpeechBuilder system.
Note that we provide each developer with a sample application domain
when they register, and provide a useful module for parsing the
semantic arguments for developers creating their CGI script using
Perl. This module is included in the SpeechBuilder starter
pack. Also included is another sample application (the "flights"
domain, which a simplified version of the Mercury travel information system). The
starter pack also includes a very basic application CGI script for
this application that illustrates the use of the parsing module.
Future Activities
The current version of SpeechBuilder has focused primarily on robust
understanding. One of the next phases of research will be to
re-design our discourse component so that it may be used by
SpeechBuilder. Although the history parameter provides a simple
mechanism for the developer to process frames in context, it makes for
extra work in the application program. A separate HLT server which
could resolve many local discourse phenomenon (as is used for our own
domains) would simplify the application processing. The developer
interface which will be used to configure the disourse server will
revolve around specifying relationships between defined concepts.
Future work will also develop an interface to create mixed-initiative
dialogues which can automatically interface with our dialogue module.
We have done some initial work in this area and believe we can design
a relativley simple interface which will enable more complex
interactions than are possible with directed-dialogue graph-based
approaches. Finally, we would like to develop an interface for our
language generation component, so that we can begin to develop
multilingual conversational interfaces with SpeechBuilder without
having to modify the application program.
Contacting Us
Comments or questions to bug-galaxy@lists.csail.mit.edu
|