Ministry of Magic: Bringing life into fiction books with engaging audio

Wed Jul 03 2024 • Andrey Paznyak

Imagine diving into your favorite book, not just reading but truly experiencing the magic, emotions, and atmosphere as if you were there. This transformation in the way we experience text is at the heart of what we do at Peech. What started as a simple app to listen to web articles has evolved into a sophisticated platform for AI audio generation. With over a million downloads, our journey has brought us to the forefront of audiobook creation for both individuals and publishers. In this article, we will explore our approach to making audiobooks for publishers and how we got here.

Discovering the Magic of Fiction Lovers

At Peech, we cater to a variety of user categories. Our users include students who listen to research papers, professionals who keep up with science, finance, and sports articles from well-known sources, and individuals with dyslexia or ADHD who benefit from our text-to-speech technology. Last year, we discovered another enthusiastic group: fiction lovers who consume vast amounts of web novels, detective & romantic stories, and other genres. Some of these users listen to up to 50 books per month! Intrigued by their dedication, we engaged in extensive conversations with them. While they were quite happy with our service, they also highlighted areas for improvement. They wanted more immersive and less monotonous narration, especially considering how long these books are. We quickly set up a process of getting feedback from them by sending new versions of audiobooks privately for assessment. This was our catalyst to innovate and bring stories to life in a whole new way.

From Text to Engaging Audio: A Harry Potter Example

To demonstrate how our technology evolved, works, and more importantly, how it sounds, let’s use an example from one of the most beloved fictional series, “Harry Potter and the Philosopher’s Stone.” For this article, we’ll use this familiar text to showcase the transformation from static text to an immersive auditory experience.

The Original Text

Professor McGonagall now stepped forward holding a long roll of parchment. ‘When I call your name, you will put on the hat and sit on the stool to be sorted,’ she said. ‘Abbott, Hannah!’ A pink-faced girl with blonde pigtails stumbled out of line, put on the hat, which fell right down over her eyes, and sat down. A moment’s pause ‘HUFFLEPUFF!’ shouted the hat. The table on the right cheered and clapped as Hannah went to sit down at the Hufflepuff table. Harry saw the ghost of the Fat Friar waving merrily at her. ‘Potter, Harry!’ As Harry stepped forward, whispers suddenly broke out like little hissing fires all over the hall. ‘Potter, did she say?’ ‘The Harry Potter?’ The last thing Harry saw before the hat dropped over his eyes was the Hall full of people craning to get a good look at him. Next second he was looking at the black inside of the hat. He waited. ‘Hmmmm,’ said a small voice in his ear. ‘Difficult. Very difficult. Plenty of courage, I see. Not a bad mind, either. There’s talent, Oh my goodness! Yes! And a nice thirst to prove yourself! Now that’s interesting … So, where shall I put you?’ Harry gripped the edges of the stool and thought, ‘Not Slytherin, not Slytherin.’ ‘Not Slytherin, eh?’ said the small voice. ‘Are you sure? You could be great, you know, it’s all here in your head, and Slytherin will help you on the way to greatness, no doubt about that — no? Well, if you’re sure — better be GRYFFINDOR!’

Our Starting Point

0:00 / 0:00

Initially, our narration was straightforward, with a single voice reading through the text. While functional, it lacked the emotional depth and variety that listeners crave.Single Voice Scene

Breathing Life into Characters

We started with a simple idea: to captivate our listeners, we needed to assign distinct voices to different characters and infuse the narration with emotion. This required sophisticated language models and natural language processing (NLP) techniques. There are two common ways books are written:

First-Person View: “I came into the room and saw Ron standing in front of me.”
Third-Person View: “Harry came into the room and saw Ron standing in front of him.”

In the first-person view, a character narrates the story, so we keep their voice consistent in dialogs. In the third-person view, we use a separate voice for the narrator (the author) and different voices for each character in the dialogs. All right, here’s a breakdown of the task:

Check if the text is narrated from a 1st or 3rd person POV.
Extract all characters involved in dialogs.
Define the characters’ personalities and relationships (age, gender).
Assign a dedicated voice to each character based on their personalities.
Track in the text to whom each line belongs.
Identify the sentiment of each line (angry, shouting, whispering, friendly, hopeful, etc.)

This is what we have as a result:

Professor McGonagall: A stern yet kind middle-aged female voice.
The Sorting Hat: An ancient, deep, and expressive male voice.
Harry Potter: A young, hopeful teenage male voice.
Narrator POV: Third person, so a dedicated voice for the narrator is needed.

Our technology automates the assignment of these voices, ensuring each line is delivered with the appropriate emotion, whether it’s excitement, nervousness, or curiosity. The result is a far richer and more engaging narrative. Check it out:

0:00 / 0:00

Adding Magic with Sound Effects

We didn’t stop at character voices. To further immerse our listeners, we added short sound effects and ambient sounds.

Short Sound Effects

These have short and consistent durations, adding dynamic elements to the narration. We categorize them into various groups:

Animal Sounds: Roar, bark
Battle Sounds: Gunshot, sword clang
People Sounds: Breath, heartbeat
Object Sounds: Opening and closing doors
Steps: Indoor/outdoor running, walking
Device Sounds: Mobile notification, phone ring
… and much more

To incorporate short sound effects, we only need to know where the sound starts in the text, and we play it once. Finding the right effect and creating a library of those was a fun and challenging task. We started with sounds like an old phone ringing, but quickly realized that modern stories required the sounds of new mobile phones. This constant need to extend our library or generate proper sounds on the fly has kept our work dynamic and ever-evolving.

Here is the next version of our audio, enriched with footsteps, applause, and even the sound of a fast-beating heart to reflect Harry’s anxiety:

0:00 / 0:00

Full scene:

0:00 / 0:00

Ambient Sounds

For longer scenes, we incorporated ambient sounds like weather effects and background music to set the mood and create a sense of place. Apart from matching the context, the technical challenge lies in determining the duration, starting and ending points, and mixing these sounds with the short sound effects and main narration seamlessly.

Result with ambient that accompanies the theme and sentiment:

0:00 / 0:00

The Power of Advanced Technology

Under the hood, there’s a lot of heavy tech stuff happening: advanced Large Language Models (LLMs), Natural Language Processing (NLP), and even in some cases old-school regular expressions to manage the tasks outlined above. Our own AI synthesis plays a significant role in creating high-quality audio. However, this year some companies introduced completely new models for AI voice generation that are extremely hard to catch up with in terms of quality and intonation. These advancements can, in some cases, even enhance the end result.

So here is the same audio but narrated with ElevenLabs voices:

0:00 / 0:00

The Shift to B2B

b2b — startup — b2c

Our journey into B2B was driven by a simple realization: we had developed quite unique solution for making audiobooks. However, the processing required for these advanced features is too intensive for real-time applications on our iOS app. Thus, we turned to the B2B market, where we could deliver high-quality, pre-processed audiobooks directly to the copyright holders. This decision quickly bore fruit. We secured contracts with companies providing web novel and fiction reading services, offering them a cost-effective way to transform their content into professional-grade audiobooks. These partnerships allowed even amateur authors to bring their stories to life in an engaging format, reaching a wider audience and enhancing their reader’s experience. By focusing on B2B, we could ensure that our technology had the time and resources needed to perform at its best, without the constraints of real-time processing. This move has opened up new opportunities and marked a significant milestone in our company’s journey. In fact, we have already produced over 30,000 hours of audiobooks, at a cost and speed that was previously unimaginable. Complex audiobooks that once took weeks to produce can now be generated in just a few hours.

About the Authors

Peech was co-founded by Aliaksei Horbel and Andrey Paznyak in 2020, who saw the potential for turning text into audio experiences for individuals. Both with technical background, Andrey, the CEO, oversees the company’s operations, while Aliaksei leads the B2B direction and is the main architect behind the innovative solutions described here.

Conclusion

Our foray into B2B has not only allowed us to showcase the true potential of our technology but also to bring stories to life in ways that were previously unimaginable. As we continue to innovate and refine our offerings, we look forward to sharing more about our journey and the magic we’re creating in the world of audiobooks. There are quite a lot of technical challenges in every step of the process described above so we’ll keep the details for another articles. Cheers!

Level up your reading with Peech

Boost your productivity and absorb knowledge faster than ever.

Start now