Living Prototypes: The Key to New Technology

References

W. P. Tanner, Jr., "The Theory of Signal Detectability as an Interpretive Tool for Psychophysical Data," J. Acoust. Soc. Am., (to be published in the September issue).

2. W. P. Tanner, Jr., and T. G. Birdsall, "Definitions of d' and ʼn as Psychophysical Measures, J. Acoust. Soc. Am., v. 30, p. 922, 1958.

3. W. P. Tanner, Jr., and J. A. Swets, "A Decision-Making Theory of Visual Detection," Psych. Rev., v. 61, p. 401, 1954.

W. P. Tanner, Jr., J. A. Swets, and D. M. Green, "Some General Properties of the Hearing Mechanism," Technical Report No. 30, Electronic Defense Group, The University of Michigan, March, 1956.

5. T. G. Birdsall, "Detection of a Signal Specified Exactly with a
Noisy Stored Reference Signal, J. Acoust. Soc. Am., v. 32,
p. 1038, August, 1960.

W. W. Peterson, T. G. Birdsall, and W. C. Fox, "The Theory of Signal Detectability, Trans. Professional Group on Information Theory, Inst. Radio Engrs., PGIT-4, 171-212, 1954.

A DECISION LOGIC FOR SPEECH RECOGNITION

William C. Dersch

IBM Advanced Systems Development Division

INTRODUCTION

The

The following pages are essentially notes on a demonstration. logic to be demonstrated represents early results of work now under way in the IBM San Jose Advanced Systems Development Division Laboratory.

In the field of speech recognition, as in others, workers soon gain an appreciation of human abilities that they might earlier have tended to take for granted. Consider, for a moment, that speech varies with the speaker, and with the emotional, grammatical and environmental contexts in which words are spoken. When diffe rent speakers pronounce the same word, the ear and mind readily overcome differences in amplitude or loudness, frequency or pitch, tone, intonation and inflection. Nor do we experience difficulty if, within broad limits, words are spoken more or less rapidly. Again within limits, we can single out one voice when several people are talking at once.

These problems people cope with tolerably well because they have an enormously efficient self-organizing memory, the human brain. We have no such brain at our disposal for mechanical speech recognition, although we do have computers, and much inte resting and valuable work has been done to adapt computers to this problem. In this field some of the most notable work has been performed by Bell Telephone Laboratories and G. L. Shultz of IBM Research.

But suppose that, for simplicity, despite the already formidable difficulties, we set ourselves the task of recognizing words without the aid of a large-scale computer. What alternatives are open to us? What departures must be made --from sophisticated statistical decision procedures, for example --to compensate us for our constraints? What are our limitations under the se constraints? How far are we likely to get?

These, in the broadest sense, are the problems, natural and selfimposed, that we of the speech recognition work in San Jose have faced, and the questions we have sought to answer.

ELEMENTARY SYSTEM LOGIC AND LIMITATIONS

Figure 1 is a simple block diagram of a speech recognition device. The figure is of sufficient generality to describe not only our work, but several ingenious speech recognition schemes developed during the past decade. From these systems we have borrowed for the overall logical flow of our own system.

We say in Fig. 1 that we want to measure certain properties of an electrical analog of the speech pressure wave, store these measured properties in an ordered sequence, analyze the stored information and decide which word was spoken.

Because we felt that full recognition of continuous speech could at best be attained only by degrees, we decided to attack first a simplified version of the problem. Thus, the following limitations would be permitted in the operation of the logic:

1. To supply word boundaries, spoken words would be arti-
ficially separated in time.

2. Words would be spoken "naturally;" i. e., not shouted,
sung, whispered or mumbled.

3. Adjustment would be permitted in order to match the
logic to the voice characteristics of the speaker.

4. Some training of the speaker (about an hour or so) would

be allowed to accommodate the machine.

5. The vocabulary would be limited.

Despite the se limitations, we wanted a realistic exposure to the problem. Hence, it was decided that our words would be common rather than highly special or coded words, and that the re be no a priori wordorde ring logic; the circuit would be able to recognize any of the words in the vocabulary in any sequence. Moreover, our device would operate in "real time," identifying words as they are spoken rather than ten minutes or an hour later. Finally, we wished the equipment to be relatively insensitive to room noise.

SOME MACHINE PROBLEMS AND TYPICAL SOLUTIONS

Assuming the elementary logic of Fig. 1, what are some of the problems and how have they typically been approached?

Because speech is a time-dependent phenomenon, our measurements must be located with respect to a time base. In order to lay off measurements along a time base, we should suppose that it would be helpful to know where the word begins. But this turns out to be a highly ambiguous event. Does the word be gin mechanically, when the mouth is opened, when the air stream starts, or when some significant measure ment is recorded? The re is no reason to believe that the mechanical and conceptual beginnings are always closely related, since the latter can occur even after the word is spoken.

Related to, but distinct from, the problem of registration is that of segmentation. What sub-units of the word, if any, shall we try to identify? Should the sub-units be syllables as we were taught to recognize them in school--syllables such as "-ing," or "-tion," for example? Should they be phonemic rather than morphemic entities--individual units of sound that we recognize, or linguists tell us we recognize, as irreducible? Should we try to recognize whole words? How shall we determine the beginnings and endings of our units of recognition? In addition, of course, there are the problems of variations of speech rate and loudness alluded to in the Introduction.

In general--and, I hasten to add, only in general--the attempts at solutions to the se problems have been in the following directions: 1. Emphasis on the identification of phonemes or phone me-like speech events.

2. Reliance on frequency information. Much excellent work has been done on frequency-band ratio analysis, on the assumption that the amounts of energy in several bands are more nearly constant relative to one another for different utterances of the same sound than the energy in a single band.

3. Tracking of the "formants" or relatively slowly changing energy concentrations in the frequency spectrum of the spoken word.

4. Normalization of the input information, by a variety of procedures, to overcome the problems indicated above; i.e., variations in loudness, length, etc.

5. Use of statistical analysis to arrive at decisions that are probabilistic in character. With such techniques, the value of the large computer is apparent, and when such techniques are employed the input will usually be digitalized or coded for processing.

THE IBM SAN JOSE WORK IN SPEECH RECOGNITION

Because the goals of the work in the Advanced Development Division Laboratory in San Jose were, in the respects noted above, limited ones, it was appropriate that our means should be relatively simple. We were not in search of a general theory of machine speech recognition-nor do we have any such theory to enunciate.

Our self-imposed boundary limitations have forced departures from the directions indicated as "typical" in the preceding section. These departures have had very inte resting consequences, and what degree of success we have had has been a consequence of this. We have borrowed where we could, but in addition have been impelled to sensitively and correctly interpret our experimental results, not from a human user's standpoint, but from a device standpoint.

Though we have made use of computers in preliminary and supporting analyses, we have forgone the aid of statistical analysis and probabilistic decision criteria in the device itself. Further, if our circuits were sensitive to the unique essence of the measured sounds, loudness, pitch and speech-rate difficulties would be minimized, and normalization would be unnecessary.

Thus we have been forced to come at the problem in a relatively new way. Instead of asking, "What measurements would we like to perform?" or "What sub-units of the word would we like to recognize ?", we have had to ask ourselves, "What characteristics of the speech wave can be

« Предыдущая Продолжить »

Книги