Re: Speech Recognition Toolkits - Requirements



Ivan,

You've mentioned speech RECOGNITION but nothing about SYNTHESIS, or
perhaps more importantly, integrating them together so they don't screw
each other up when conducting a conversation.

Despite the variety of ASR and SAPI systems around, there is good reason
that there are NOT conversational programs - because no one is
integrating these systems with an API that is capable of carrying on a
conversation.

I am now working on just such an interface, but there are a LOT of
previously unknown problems that block the way, a couple of which include:

1. The ASR engine must often continue to function AFTER a pause or other
signal that the speaker is done speaking - just to finish recognizing
what they said. However, these suck up so much CPU resources that they
mess up the operation of speech syntheses, so the ASR must probably have
its CPU priority momentarily dropped to avoid making the synthesis
engine stutter.

2. During the above-mentioned period after an utterance, it may be
necessary to synthesize something WITHOUT the entire text of the
utterance and/or without enough CPU activity to analyze whatever was
said so far. People do this all the time, so why shouldn't computers?!

As you can see, integration of recognition and synthesis, while being
simpler than either recognition or synthesis, is nonetheless still a
non-trivial thing. Clearly, the toolkit you are working on will fall
into one of the following categories:

1. Supports conversational operation,

2. Provides the APIs for someone else to write an interface that
supports conversational operation, or

3. Is useless for conversational operation.

Of course the direction you take is for you to choose, but realize that
conversational operation is "just around the corner" so if you don't
support it, your efforts will probably become obsolete even before they
are completed.

Steve Richfie1d
=========================
Dear All

I am gathering requirements for a family of speech recognition toolkits
(which for the moment I'm calling Quivis Oratio).  The toolkits should
cover everything from gathering raw data (speech and text) to building
a 'Speech Engine' (i.e., an expanded recogniser, compatible with speech
APIs like Sun's JSAPI and/or Microsoft's MSAPI).

Initially, I'm considering three toolkits:

  Field Assistant:
    These tools facilitate collecting data, and processing the data
into acoustic models and language models (and any other models we might
need).

  Engine Builder:
    These tools help the developer build a language-specific ASR Engine
around a set of mathematical models (e.g., as generated by Field
Assistant), providing required APIs.

  ASR Engine for Welsh:
    This is an example output from the Engine Builder, given Welsh
models (e.g., output from the Field Assistant), ready for use in an
application.

I plan to release the requirements and design documentation, and
ultimately the source code, for the first two of these, under
appropriate open licences (i.e., GPL, BSD and/or Creative Commons
licences).  Licencing restrictions on the Welsh ASR Engine will depend
on intellectual property negotiations with data providers.

If you are interested in either developing for or using this kind of
software, please visit the project website at
http://www.iau.ukfsn.org/srtk/.  The requirements page
(http://www.iau.ukfsn.org/srtk/requirements/index.html) has links to a
user questionnaire and a wiki for discussion.  All comments are
welcome.

This project is being part-funded by a SMARTCymru Technical Feasibility
Phase grant from the Welsh Development Agency
(http://www.wda.co.uk/smartcymru/).

Best wishes

Ivan Uemlianin


Annwyl Bawb

Rydw i'n mynd ati i gasglu gofynion ar gyfer set o becynnau cymorth
adnabod lleferydd (yr ydw i'n ei galw'n Quivis Oratio ar hyn o bryd).
Fe ddylai'r pecynnau cymorth gynnwys popeth o gasglu data crai
(lleferydd a thestun) i greu 'Peiriant Lleferydd' (h.y., un estynedig
sy'n adnabod lleferydd, ac sy'n gyson ag APIs lleferydd fel JSAPI Sun
a/neu MSAPI Microsoft).

Yn gyntaf, rydw i'n ystyried tri phecyn cymorth:

  Cynorthwyydd Maes:
    Mae'r offer yma'n hwyluso'r broses o gasglu data, ac o brosesu'r
data'n fodelau acwstig a modelau iaith (ac unrhyw fodelau eraill y
gallwn ni fod eu hangen).

  Crëwr Peiriannau:
    Mae'r offer yma'n helpu'r datblygwr i greu Peiriant ASR ar sail set
o fodelau mathemategol ar gyfer iaith benodol (e.e., yn ôl yr hyn y
mae'r Cynorthwyydd Maes yn ei greu), gan gynnig yr APIs sy'n ofynnol.

  Peiriant ASR ar gyfer y Gymraeg:
    Allbwn enghreifftiol yw hwn o'r Crëwr Peiriannau gyda modelau
Cymreig penodol (e.e., allbwn o'r Cynorthwyydd Maes), yn barod i'w
ddefnyddio mewn cymhwysiad.

Fy mwriad yw rhyddhau'r dogfennau sy'n disgrifio'r gofynion a dyluniad
y ddau cyntaf, a ffynhonnell y cod yn y pen draw, dan drwyddedau agored
addas (h.y., trwyddedau GPL, BSD a/neu Creative Commons).  Bydd y
cyfyngiadau ar drwydded Peiriant ASR y Gymraeg yn dibynnu ar y broses o
drafod eiddo deallusol â'r rhai sy'n darparu'r data.

Os oes diddordeb gennych chi mewn naill ai datblygu ar gyfer y math yma
o feddalwedd, neu ei defnyddio, ewch i wefan y project ar
http://www.iau.ukfsn.org/srtk/cy/.  Mae cysylltau ar dudalen y gofynion
(http://www.iau.ukfsn.org/srtk/cy/requirements/index.html) â holiadur
ar gyfer defnyddwyr a wiki ar gyfer trafod.  Mae croeso i chi gynnig
unrhyw sylwadau.

Mae'r project yma'n cael ei ariannu'n rhannol â grant cyfnod
Dichonoldeb Technegol SMARTCymru gan Awdurdod Datblygu Cymru
(http://www.wda.co.uk/smartcymru/).

Dymuniadau gorau

Ivan Uemlianin


.



Relevant Pages

  • Re: Problems with a conversational speech interface ...
    ... > recognition engine may be backed up recognizing other input. ... Hmmm, DNS sounds slow. ... When you do your own ASR, you choose the speed and latency yourself. ... >>>ASR made of your speech synthesis output? ...
    (comp.speech.users)
  • Re: Problems with a conversational speech interface ...
    ... The only off-the-shelf ASR I ever used was viavoice, ... You mean DNS doesn't? ... The time that DNS takes to complete recognition, ... The vocabulary of the regexps is a little less than 300K ...
    (comp.speech.users)
  • Re: NewLine bug of Windows Speech Recognition revisited
    ... especially where spelling is concerned and that is with any app! ... "Queen's English" (going on how long it has taken you to learn speech ... In reading into certain posts, most people in here including myself think ... so long to learn how to use speech recognition and more importantly SPELLING ...
    (microsoft.public.windows.vista.general)
  • Re: Computerized Audio Project.
    ... The Speech synthesizer software built into ... Windows XP Professional isn't enough? ... Modern Windows desktop systems can use SAPI 4 and SAPI 5 components to ... support speech synthesis and speech recognition. ...
    (sci.electronics.design)
  • Re: 2 rnorman
    ... 30,000 axons of which 90% or so innervate the inner hair cells. ... bit), a tremendous amount of speech information is in the range below, ... these 39 features have proven to be the least worst for recognition ... rather than evolving brain structures to ...
    (talk.origins)