Speach process and translate with Speach Services

Rating & reviews (0 reviews)
Study notes

Speach service

  1. Speech-to-Text
    Core API
    API that enables speech recognition in which your application can accept spoken input.
  2. Text-to-Speech
    Core API
    API that enables speech synthesis in which your application can provide spoken output.
  3. Speech Translation
  4. Speaker Recognition
  5. Intent Recognition
Process:
  • Create Resource (Speach Service dedicated or Cognitive Services)
  • Get Resource Location & On Key (Resource Keys/Endpoint)

1. Speech-to-Text
Processed: Interactive (real time) or batch.
In practice, most interactive speech-enabled applications use the Speech service through a (programming) language-specific SDK
Speech service supports speech recognition via:
  • Speech-to-text API, which is the primary way to perform speech recognition.
  • Speech-to-text Short Audio API, which is optimized for short streams of audio (up to 60 seconds).
Main parameters to configure:
  • SpeechConfig object to encapsulate the information required to connect to your Speech resource (location & key)
  • Write AudioConfig (Optional) to define the input sourcefor the audio to be transcribed (microphone or audio file)
Result:
  • Recognized Speach (value), NaN or cancel
  • transcript
2. Text-to-Speech
Speech service offers two APIs for speach synthesis (spoken out from text):
  • Text-to-speech API, which is the primary way to perform speech synthesis.
  • Text-to-speech Long Audio API, which is designed to support batch operations that convert large volumes of text to audio.
Main parameters to configure:
  • SpeechConfig object to encapsulate the information required to connect to your Speech resource (location & key)
  • Write AudioConfig (Optional) to define the output device for the speech to be synthesized (default system speaker, null value or audio stream object that is returned directly)
Result:
  • Reason property is set to the SynthesizingAudioCompleted enumeration.
  • AudioData property contains the audio stream.

Config file -SpeechConfig
Speech service supports multiple output formats (audio)for the audio stream that is generated by speech synthesis.
Depending on your specific needs, you can choose a format based on the required:
  • Audio file type
  • Sample-rate
  • Bit-depth
SetSpeechSynthesisOutputFormat method (SpeechConfig object) - specify the required output format.
Speech service provides multiple voicesthat you can use to personalize your speech-enabled applications
  • Standard voices - synthetic voices created from audio samples.
  • Neural voices - more natural sounding voices created using deep neural networks.
SpeechSynthesisVoiceName- specify a voice for speech synthesis.

Speech Synthesis Markup Language
Speach service use
  • Speech SDK enables you to submit plain text to be synthesized into speech (via SpeakTextAsync() method)
  • Speech Synthesis Markup Language(SSML) - XML-based syntax for describing characteristics of the speech you want to generate.
    • Specify a speaking style (excited, cheerful...)
    • Insert pauses or silence.
    • Specify phonemes (phonetic pronunciations)
    • Adjust the prosody of the voice (affecting the pitch, timbre, and speaking rate).
    • Use common "say-as" rules (phone no, date...)
    • Insert recorded speech or audio (include a standard recorded message)
  • SpeakSsmlAsync() - submit the SSML description to the Speechservice.

Translate speech
Built on speech recognition:
  • Recognize and transcrib spoken input in a specified language
  • Return translations of the transcription in one or more other languages
Prerequirements:
  • Speech or Cognitive Service resource must be already created.
  • Have location and one Key (above service)
Main parameters to configure:
SpeechConfigobject - information required to connect to your Speech resource (location, key)
SpeechTranslationConfig object (input language, target languages)

Return if successs:
  • Reason property has the enumerated value RecognizedSpeech
  • Text property contains the
    • Transcription in the original language
    • Translations property contains a dictionary of the translations (using the two-character ISO language code, such as "en" for English, as a key).
The TranslationRecognizerreturns translated transcriptions of spoken input - essentially translating audible speech to text, it must be spoken out.

Event based synthesis
1 to 1 translation, you can use event-based synthesis to capture the translation as an audio stream
  • Specify the desired voice for the translated speech in the TranslationConfig.
  • Create an event handler for the TranslationRecognizer object's Synthesizing event.
  • In the event handler, use the GetAudio() method of the Result parameter (retrieve audio)
Manual synthesis
Doesn't require you to implement an event handler. You can use manual synthesis to generate audio translations for one or more target languages.
  • Use a TranslationRecognizerto translate spokeninput into text transcriptionsin one or more target languages.
  • Iterate through the Translations dictionaryin the result of the translation operation, using a SpeechSynthesizerto synthesize an audio stream for each language.

Speach Service core API, Login to view

Hands-On Create a speech-enabled app, Login to view

Hands-On Translate speech, Login to view

Resources