The truth about AI voices for audio guides - how good are they really?

Nubart's blog - the truch about AI voices in audio guides

AI voices replacing human voices offer new opportunities for the production of audio guides. Easy and fast creation of voices narrating the script in different languages with the help of AI can save a lot of money and time. However, there are some limitations to these voices as machines are not as easily adaptable as humans (yet). This article points out a few of the issues you may face when implementing them in an audio guide "for real".

How does it work?

AI has made many aspects of our lives easier and more enjoyable. With the advent of AI-generated voices, the audio guide industry has also been affected. Using technologies such as text-to-speech (TTS) and natural language processing (NLP), any text can be converted into an artificial voice that can guide your visitors. Looking at the benefits, choosing AI voices for your audio guide has the following advantages:

Main advantages of AI-voices for an audio guide

1 - Affordable

If you are on a tight budget but still want to produce an audio guide for your museum, AI voices might be a good solution for you. Compared to human voices production, AI voices are a lot cheaper as it’s mainly a machine doing the work.

2 - Quick production... up to a point

Using AI voices for your audio guide guarantees fast production, as different languages and accents can be produced quickly. However, it's not as simple as pressing a button and having a magical human voice.

If you are serious about the quality of your audio guide, you will need to listen carefully to each track and, in most cases, edit them manually. This editing process can take a long time as AI is not perfect yet and adjustments need to be made by humans.

First you have to allow time for the rendering. Once you have uploaded your script and chosen the voice you like best, you have to wait for your chosen platform to work its magic and produce the track. A short text of 65 words (= four lines) will take about 7 seconds to render. 7 seconds may not seem too bad... but it becomes very annoying as soon as you realize that you have to wait 7 seconds again and again for every edit you make.

Only if you have a large amount of text to convert to voice, and don't care much about the quality of the result, will AI voices be not only affordable, but also fast.

3 - Faithful recording of the script

Anyone who has worked with human narrators will know that misspelled or transposed words and omissions are almost inevitable. Hence, a tedious but unavoidable part of an audio guide production is listening to all the audio tracks while reading the script for comparison to find those mistakes.

At least in this sense, AI voices are perfect: what you have in the script is what you get. But don't rejoice too soon: you're going to have to listen to them all anyway, albeit for other reasons (see below).

Those were the advantages. Now let's look at the cons:

Main disadvantages of AI-voices for audio guides

1 - Lack of emotion

AI voices are getting better at sounding human. But no matter how, machines are still machines and, at least in the near future, will not have the full human ability to express emotion. Voice actors are trained to play roles in a way that feels very real to the listener. If it's important to you that your voices are full of character, captivating, emotive and leave a lasting impression, human voices may be a better choice. Especially if you have a very specific idea of what you want your voices to sound like, an AI voice may not live up to your expectations.

Most current AI voice platforms allow you to set a tone for your voice, such as 'inspirational', 'promotional', 'sad', 'calm' or 'conversational'. It's a nice approach, but it takes a while to experiment with the different options and see how each of these tones sounds for each of the suggested AI voices, especially as you have to wait for the rendering to happen each time.

For example if you are looking for an AI voice that expresses the very subtle sadness that the sentence: "Unfortunately, the rest of the building was completely destroyed during the war” would require, you are likely to have difficulties, as the sad tone from AI voice generators would express drama.

Subtlety is not (yet?) a skill that AI voices have mastered!

2 - Negative subconscious perception

A study conducted by CloudArmy which tested the implicit responses of participants to ads voiced by AI found out that while humans are usually not able to consciously tell the difference between human and AI voices, their implicit responses to artificial voices were less positive and trusting than their responses to ads voiced by humans.

Even if the audio guide provides a different context than an ad does, these insights should be taken into consideration. In order to successfully integrate AI voices it’s important to understand how visitors will perceive them.

AI voices are like a piano sonata created by a synthesiser, sounding more perfect than a human could ever play, but unable to subconsciously resonate with us. They may lack the beauty of human imperfection.

3 - Mistakes difficult to correct

AI technology is still developing. Mistakes are not uncommon. For example the machine might not recognize the way punctuation is done in a certain language and therefore has problems creating the right intonation. Or numbers are not pronounced correctly. These errors have to be fixed by humans which can be very time consuming and is not always possible.

Moreover, this leads to the fact that the language used must be spoken by a team member in order to be able to correct mistakes. Otherwise you will easily get a soundtrack that sounds French... to you but not to a French native speaker!

Moreover, communicating instructions to an AI voice generator is difficult. To a human speaker you can say "please pause after every sentence" and they will understand immediately. But with an artificial voice, you have to manually point out all those pauses throughout the script and decide for each how long it should be.

Abbreviations are a major challenge. Most AI voices will probably pronounce the most well-known abbreviations, such as "NATO" or "NASA" correctly, because they have been instructed by the programmers to do so. But for example if you have "RIP" in your script, the artificial voice will probably say "ripe", as if it were a fruit. You'll have to search for all the abbreviations in your script and replace them one by one with what you want to hear: 'Rest in peace' or, weirdly enough, 'are eye pee'.

4 - The nightmare of foreign words

Another major challenge of AI voices is the pronunciation of foreign words, especially in the museum world, a sector that is particularly sensitive to cross-cultural communication.

For example, imagine having a fashion museum as a client with Ermenegildo Zegna dresses on display. Despite several attempts, we have not been able to get any of the AI voices to pronounce the name of this brand correctly, in one breath and with an accent on the syllable "gil". The English or French AI voices we have tried make pauses between each syllable, which is unbearable to an Italian ear. Especially if your client is based in Italy, you have a problem!

Most AI voice generators allow you to apply the International Phonetic Alphabet (IPA) for certain words. However, so far we have not seen very satisfactory results. Moreover, this approach requires a lot of patience and highly specialized knowledge, especially because it is not always possible to find the IPA transcription online.

How AI voices are applied in an audio guide

If you decide to use AI voices, you will need to register with an AI voice generator, such as Murf AI, ElevenLabs, PlayHT, LOVO AI, Narakeet, Resemble AI or Typecast, among many others. You can choose from a variety of voices representing different ages, genders, languages and moods. Once you have chosen a voice and uploaded your script, you can render your audio guide. From this point on there are two possible ways to proceed with the production of your AI audio guide:

The voices are created by the AI voice generator without any further editing. You accept that there will be many errors, such as missing pauses and mispronunciations, which won't be removed and which visitors are likely to notice.
The second option is for a member of the team to manually edit the AI audio tracks by adding or removing pauses, adjusting pitch, speed and tone, and correcting mispronunciations where possible. This is how Nubart works with AI voices in our lowest service level, called Copper. It's much more time-consuming than the first option, but we take all the time we need to meet our high standards and achieve a final result that satisfies our customers, within the IA constraints described here.

What voice do you want?

If you take a closer look at AI voices compared to human voices, you will quickly see that AI voices are not perfect (yet). Human voices are still winning the "battle". Also in terms of the time and effort invested by the production and sound editing team. Human voices are able to convey real emotion, leave a lasting impression and tell the story you really want to tell. However, if your budget doesn't currently allow for human voices, AI voices can be a good alternative if expectations are kept realistic. These voices may not be able to live up to high standards, but they can still get the job done and provide valuable information to your visitors.