• 18-19 College Green, Dublin 2
  • 01 685 9088
  • info@cunninghamwebsolutions.com
  • cunninghamwebsolutions
    Cunningham Web Solutions
    • Home
    • About Us
    • Our Services
      • Web Design
      • Digital Marketing
      • SEO Services
      • E-commerce Websites
      • Website Redevelopment
      • Social Media Services
    • Digital Marketing
      • Adwords
      • Social Media Services
      • Email Marketing
      • Display Advertising
      • Remarketing
    • Portfolio
    • FAQ’s
    • Blog
    • Contact Us
    MENU CLOSE back  

    Text To Speech With AWS

    You are here:
    1. Home
    2. Web Design
    3. Text To Speech With AWS
    Amazon Polly Console

    Text To Speech With AWS

    Text To Speech With AWS

    Philip Kiely

    2019-08-01T14:00:00+02:00
    2019-08-01T12:35:46+00:00

    This two-part series presents three projects that teach you how to use AWS (Amazon Web Services) to transform text between its written and spoken states. The first project will use text to speech to turn a blog post or other written content into a spoken .mp3 file to give more options to blind and dyslexic users of your site.

    In the next article, we will embark on the return journey, from speech to text, and consider the accuracy of these transcriptions by sending various samples through a round-trip translation. To follow these tutorials, you will need an AWS account with billing enabled, though the tutorials will stay well within the constraints of free-tier resources. Examples will focus on using the AWS console, but I will also demonstrate the AWS CLI (Command Line Interface), which requires basic command line knowledge.

    Introduction And Motivation

    Most of the internet is text-based. Text is lightweight (1 byte per letter), widely supported, easy to interpret, and has a precedent as old as the internet as the default medium of online communication. Sending written text predates the internet: telegraphs carried text over wires hundreds of years ago and physical mail has transmitted writing for centuries. Voice transmission over radio and telephone also predates the internet, but did not translate to the same foundational medium that text did online. This is in almost all cases a good thing, again, text is lightweight and easy to interpret compared to audio. However, transforming between voice and text can add powerful functionality to and improve the accessibility of a wide variety of applications.

    It has always been possible to transform between audio and text, you can read a written speech or transcribe an oral sermon. Indeed, if we think back to the telegram, trained operators transcoded Morse Code messages to words. In each example, it has always been very labor intensive to move from speech to writing or back, even with specialized training and equipment. With a variety of cloud services, we can automate these processes to allow transitioning between mediums in seconds without any human effort, which expands the possible use cases.

    The most obvious benefit of implementing appropriate text to speech and speech to text options is accessibility. A visually impaired or dyslexic user would benefit from a narrated version of an article, while a deaf person could become a member of your podcasting audience by reading a transcript of the show.

    Text to Speech Project

    Say you wanted to add narrated versions of every post to your blog. You could purchase a microphone and invest hours into recording and editing spoken renditions of each post. This would result in a superior listener experience, but if you want most of the benefit for only a couple of minutes and a few pennies per post, consider using AWS instead. If you are the sort of person who regularly updates and revises older or evergreen content, this method also helps you keep the spoken version up to date with minimal effort.

    We will begin with text to speech using Amazon Polly. For simple exploration, AWS provides a graphical user interface through its online console. After logging in to your AWS account, use the “Services” menu to find “Amazon Polly” or go to https://us-east-1.console.aws.amazon.com/polly/home/SynthesizeSpeech.

    Using the Polly Console

    Amazon Polly Console

    Amazon Polly provides a console to perform text-to-speech operations. (Large preview)

    You can use the Amazon Polly console to read 3,000 characters (about 500 words) and get an audio stream or immediate download. If you need up to 100,000 characters (about 16,600 words) read, your only option is to have AWS store the result in S3 after it has finished processing, which can take a couple of minutes. At the time of writing, Amazon Polly does not support inputs of over 100,000 billable characters, if you want to convert a longer text like a book you will most likely have to do so in chunks and concatenate the audio files yourself.

    A “billable character” is one that the service actually pronounces. Specifically, that means that SSML tags are not billable characters, which we will cover later. For your first year of using Amazon Polly, you get 5 million billable characters per month for free, which is more than enough to run the examples from this article and do your own experimentation. Beyond that, Amazon Polly costs four dollars per million billable characters at the time of writing, meaning that converting a standard-length novel would cost about two dollars.

    The console also allows you to change the language, region, and voice of the reader. Though this article only covers English, at the time of writing AWS supports 21 languages and 29 distinct language-region pairs. While most regions only have one or two voices, popular ones like United States English have several options to chose between.

    Amazon Polly narrated text is very obviously read by a robot, but the resulting audio is quite listenable.

    “

    I often prefer to use the UK English voice “Brian.” To my American ears, the British accent covers some of the inflections in robotic speech and makes for a smoother listening experience. To be clear, Amazon Polly narrated text is very obviously read by a robot, but the resulting audio is quite listenable.

    It is significantly better than the built-in reader that the MacOS say terminal command uses, and is comparable to the speech quality of voice assistants like Siri and Alexa.

    Writing SSML

    If you want full control over the resultant speech, you can take the time to tag your input with SSML. SSML (Speech Synthesis Markup Language) is a standardized language for representing verbal cues in text. Like HTML, XML, and other markup languages, it uses opening and closing tags. Amazon Polly supports SSML input, and tags do not count as “billable characters.” Alexa skills also use SSML for pre-programmed responses, so it is a worthwhile language to know.

    The foundational tag, , wraps everything that you want read. Like HTML, use

    to divide paragraphs, which results in a significant pause in the narration. Smaller pauses come from punctuation, and you always have the option to insert pauses of up to ten seconds with .

    SSML provides , a very flexible tag that supports everything from pronouncing phone numbers to censoring expletives using the interpret-as argument. Consider the options from this tag with the following sample.

    <speak>
    Call 5551230987 by 11'00" PM to get tips on writing clean JavaScript.<break time="1s"/>
    Call <say-as interpret-as="telephone">5551230987</say-as> by 11'00" PM to get tips on writing clean <say-as interpret-as="expletive">JavaScript</say-as>
    </speak>

    Further flexibility comes from the tag, which provides you with control over the rate, pitch, and volume of speech. Unfortunately, at the time of writing Polly does not support the tag, which Alexa skills can use to speak in multiple standard voices, but does support the tag that allows voices in one language to correctly pronounce words from other languages. In this example, corrects the pronunciation of “tag” from American to German.

    <speak>
        Guten tag, where is the airport?<break time="1s"/>
        <lang xml:lang="de-DE">Guten tag</lang>, where is the airport>
    </speak>

    Finally, if you want to customize pronunciation within a language, Amazon Polly supports the tag.

    My last name, Kiely, is spelled differently than it is pronounced. Using the x-sampa alphabet, I am able to specify the correct pronunciation.

    “

    <speak>
        Philip Kiely<break time="1s"/>
        Philip <phoneme alphabet="x-sampa" ph="ˈkaI.li">Kiely</phoneme>
    </speak>

    This is not an exhaustive list of the customization options available with SSML. For a complete reference, visit the documentation.

    Writing Lexicons

    If you want to specify a consistent custom pronunciation or expand an abbreviation without tagging each instance with a phoneme tag, or you are using plain text instead of SSML, Amazon Polly supports lexicons of custom pronunciations. You can apply up to five lexicons of up to 4,000 characters each per language to a narration, though larger lexicons increase the processing time.

    As with before, I want to make sure that Amazon Polly says my name correctly, but this time I want to do so without using SSML. I wrote the following lexicon:

    <?xml version="1.0" encoding="UTF-8"?>
    <lexicon version="1.0"
          xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
            http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
          alphabet="x-sampa"
          xml:lang="en-US">
      <lexeme><grapheme>Kiely</grapheme><alias>ˈkaIli</alias></lexeme>
    </lexicon>

    The header and tag will stay mostly constant between lexicons, though the tag supports two important arguments. The first, alphabet, lets you choose between x-sampa and ipa, two standard pronunciation alphabets. I prefer x-sampa because it uses standard ASCII characters, so I am unlikely to encounter encoding issues. The xml:lang argument lets you specify language and region. A lexicon is only usable by a voice from that language and region.

    The lexicon itself is a sequence of tags. Each one contains a tag, which contains the original text, and the tag, which describes what you want said instead. Aliases go beyond pronunciation, you can use them for expanding abbreviations (“Jr” becomes “Junior”) or replacing words (“Bruce Wayne” becomes “Batman”). A lexicon can have as many lexeme tags as it can fit in the 4,000 character limit.

    Amazon Polly Console with lexicon loaded

    The included lexicon will modify the pronunciation of the input text. (Large preview)

    The screenshot shows the plain text that would be mispronounced and the applied lexicon. Use the “Customize Pronunciation” menu to select up to five uploaded lexicons, uploaded from the left navbar tab “Lexicons.” Listening to the speech verifies that my name is said correctly.

    Now that we have full control over the resultant speech, let’s consider how to save the output for use in our application.

    Saving and loading from S3

    If you want to re-use spoken text in your application, you’ll want to choose the “Synthesize to S3” option in the Amazon Polly console. In this example, I am using the voice “Brian” to perform a surprisingly capable reading of Shakespeare’s sonnet XXIX. We begin by copying in the poem as plain text and selecting “Synthesize to S3,” which launches the following modal.

    S3 Synthesize Modal

    The ‘Synthesize to S3’ button gives you options for where to save the resultant file. (Large preview)

    S3 buckets have globally unique names, and you can enter any S3 bucket that you own or have the appropriate permissions to. Make sure the bucket allows for making its contents public, as that will be required in a future step. You should also set a “S3 key prefix,” which is a string that will help you identify the output in the bucket. After clicking Synthesize and giving it a moment to process, we navigate to the S3 bucket that we synthesized the speech into.

    S3 Bucket main page

    A S3 bucket stores your project’s files. (Large preview)

    The arrow points to the entry in the bucket that we just created. Selecting that item will bring us to the following page.

    S3 Bucket file view

    For each file, you can make it public using this button. (Large preview)

    Follow the arrow to select the “Make Public” option, which will make the file accessible to anyone with a link. Scroll down and copy the link and use it in your application. For example, you can download the poem here. For many applications, you may wish to pass the url to an html tag to allow for web playback.

    We have covered every necessary component for transforming text to speech on AWS. Next, we turn our attention to a more advanced interface that can provide automation potential and save time.

    Using the AWS CLI

    Back to our hypothetical blog post. The simplest workflow would be to take the final written version of each article, copy it into the console, click the “Synthesize to S3 button,” and embed a download link to the resultant .mp3 file in the blog. Honestly, this is a pretty decent workflow; it is exactly what I do for my personal website. However, AWS offers another option: the AWS CLI.

    Make sure that you have installed and configured the AWS CLI appropriately. Begin by entering aws polly help to make sure that Polly is available and to read a list of supported commands. For troubleshooting, see the documentation.

    To perform a conversion from the command line, I first copied the poem from earlier into a .txt file. I then ran the following command in terminal (MacOS/Linux):

    aws polly synthesize-speech 
        --output-format mp3 
        --voice-id Joanna 
        --text "`cat sonnetxxix.txt`" 
        poem.mp3
    

    In a few seconds, the resulting .mp3 file was downloaded to my machine, ready for inclusion in my CMS or other application. Note the special characters around the --text argument, this passes the contents of the file rather than just the file name.

    Finally, for more advanced applications, Amazon Polly has an SDK for 9 languages/platforms. The SDK would be overkill for these examples, but is exactly what you want for automating Amazon Polly calls, especially in response to user actions.

    Conclusion

    Text to speech can help you create more versatile, accessible content. Beginning in the Amazon Polly console, we can transform up to 100,000 billable characters in plain text or SSML, make the resulting .mp3 file public, and use that file in an application. We can use the AWS CLI for automation and more convenient access.

    Stay tuned for the second installment of the series, we will convert media in the other direction, from speech to text, and consider the benefits and challenges of doing so. Part two will build on the technologies that we have used so far and introduce Amazon Transcribe.

    Further Reference

    • AWS Polly
    • AWS S3
    • SSML Reference
    • Managing Lexicons

    (yk,ra)

    From our sponsors: Text To Speech With AWS

    Posted on 1st August 2019Web Design
    FacebookshareTwittertweetGoogle+share

    Related posts

    Archived
    22nd March 2023
    Archived
    18th March 2023
    Archived
    20th January 2023
    Thumbnail for 25788
    Handling Continuous Integration And Delivery With GitHub Actions
    19th October 2020
    Thumbnail for 25781
    Supercharge Testing React Applications With Wallaby.js
    19th October 2020
    Thumbnail for 25778
    A Monthly Update With New Guides And Community Resources
    19th October 2020
    Latest News
    • Archived
      22nd March 2023
    • Archived
      18th March 2023
    • Archived
      20th January 2023
    • 20201019 ML Brief
      19th October 2020
    • Thumbnail for 25788
      Handling Continuous Integration And Delivery With GitHub Actions
      19th October 2020
    • Thumbnail for 25786
      The Future of CX with Larry Ellison
      19th October 2020
    News Categories
    • Digital Marketing
    • Web Design

    Our services

    Website Design
    Website Design

    A website is an important part of any business. Professional website development is an essential element of a successful online business.

    We provide website design services for every type of website imaginable. We supply brochure websites, E-commerce websites, bespoke website design, custom website development and a range of website applications. We love developing websites, come and talk to us about your project and we will tailor make a solution to match your requirements.

    You can contact us by phone, email or send us a request through our online form and we can give you a call back.

    More Information

    Digital Marketing
    Digital Marketing

    Our digital marketeers have years of experience in developing and excuting digital marketing strategies. We can help you promote your business online with the most effective methods to achieve the greatest return for your marketing budget. We offer a full service with includes the following:

    1. Social Media Marketing

    2. Email & Newsletter Advertising

    3. PPC - Pay Per Click

    4. A range of other methods are available

    More Information

    SEO
    SEO Services

    SEO is an essential part of owning an online property. The higher up the search engines that your website appears, the more visitors you will have and therefore the greater the potential for more business and increased profits.

    We offer a range of SEO services and packages. Our packages are very popular due to the expanse of on-page and off-page SEO services that they cover. Contact us to discuss your website and the SEO services that would best suit to increase your websites ranking.

    More Information

    E-commerce
    E-commerce Websites

    E-commerce is a rapidly growing area with sales online increasing year on year. A professional E-commerce store online is essential to increase sales and is a reflection of your business to potential customers. We provide professional E-commerce websites custom built to meet our clients requirements.

    Starting to sell online can be a daunting task and we are here to make that journey as smooth as possible. When you work with Cunningham Web Solutions on your E-commerce website, you will benefit from the experience of our team and every detail from the website design to stock management is carefully planned and designed with you in mind.

    More Information

    Social Media Services
    Social Media Services

    Social Media is becoming an increasingly effective method of marketing online. The opportunities that social media marketing can offer are endless and when managed correctly can bring great benefits to every business.

    Social Media Marketing is a low cost form of advertising that continues to bring a very good ROI for our clients. In conjuction with excellent website development and SEO, social media marketing should be an essential part of every digital marketing strategy.

    We offer Social Media Management packages and we also offer Social Media Training to individuals and to companies. Contact us to find out more.

    More Information

    Cunningham Web Solutions
    © Copyright 2025 | Cunningham Web Solutions
    • Home
    • Our Services
    • FAQ's
    • Account Services
    • Privacy Policy
    • Contact Us