Saturday, December 14, 2024

Voice Content material and Usability – A Checklist Aside

Voice Content material and Usability – A Checklist Aside

Conversations have been taking place for centuries. Throughout history, people have engaged in spoken dialogue, conveying information, conducting transactions, and merely keeping tabs on one another, through a blend of verbal communication and nonverbal cues. In the course of the past few millennia, humanity has started committing oral exchanges to writing, and only in the past few decades have we delegated this task to computers – machines that exhibit a marked preference for formal, written communication over the colloquialisms inherent in spoken language.

Computers struggle due to the disconnect between spoken and written languages, with speech being inherently more primal. Machines must be able to navigate the complexities of human communication, embracing the nuances of spoken language – from disfluencies and pauses to nonverbal cues and regional dialects – in order to facilitate genuinely productive conversations with us. In face-to-face interactions, spoken language enjoys a unique advantage: the opportunity for direct nonverbal communication, allowing us to effortlessly read and respond to subtle social cues.

While written language solidifies into a tangible form upon commitment, retaining usage long after its expiration in spoken communication – exemplified by the lingering “To whom it may concern” – it yields its own fossil record of obsolete terms and expressions. As a consequence, written text that is deliberately concise, refined, and professional is often easier for machines to process and comprehend.

Spoken language lacks such luxuries. Beyond the nonverbal cues that imbue conversations with emphasis and emotional depth, verbal cues and vocal behaviors also play a crucial role in shaping dialogue through subtle yet significant modulations: what is said can be conveyed in various ways. Regardless of tone, pace, and inflection – rapid-fire, low-pitched, or high-decibel, with or without a hint of sarcasm, stiltedness, or resignation – our spoken language conveys nuances and subtleties that written words often struggle to capture. As designers and content strategists, we encounter exciting hurdles when crafting voice interfaces that facilitate seamless spoken conversations with machines.

We collaborate with voice interfaces across a broad spectrum of applications. Despite this, our motivations align closely with those identified by Michael McTear, Zoraida Callejas, and David Griol (), whose research suggests that our conversational prompts are also driven by these same factors. Typically, we initiate conversations in response to:

  • We seek to achieve a singular outcome equivalent to a single transaction.
  • We seek to acquire specific information of some description.
  • As inherently social creatures, we crave meaningful interactions and dialogue that fosters connection and understanding.

Three key classes – Start, End, and Dialogue – further define how each conversation begins and ends, comprising a single interaction from start to finish that achieves a specific outcome for the user, commencing with the voice interface’s initial greeting and concluding when the user exits. Here: Observe that a conversation in the human sense—a discussion between individuals that yields some outcome and persists for a duration of varying length—can comprise multiple sequential transactions, information exchanges, and prosocial voice interactions. A dialogue can manifest in various forms of interplay, yet it’s misleading to equate every instance with a solitary voice exchange.

Conversations with pure, unadulterated small talk often come across as insincere and lacking in genuine appeal within most voice interfaces, due to machines being unable to genuinely grasp our emotional state or engage in the kind of social niceties humans take for granted. Whether customers prefer natural human dialogue that commences with a prosocial tone, seamlessly transitioning between various forms, remains an ongoing topic of debate. While it’s true that Michael Cohen, James Giangola, and Jennifer Balogh suggest adhering to customer expectations by mirroring their interactions with various voice interfaces rather than attempting to be overly human and risk alienating them in the process.

Two primary types of conversations exist between humans and voice interfaces, which also pertain to human-to-human interactions: one type yields a tangible outcome (“Order iced tea”) and another educates or guides individuals on a specific topic (“Discuss the concept of jazz”).

Transactional voice interactions

Unless you’re placing an order through a meal delivery app by clicking buttons, most food purchases involve a conversation and subsequent voice interaction, such as when ordering a Hawaiian pizza with extra pineapple? As we approach the counter to place our order, the conversation seamlessly shifts from casual pleasantries to the primary objective: customizing a pizza generously topped with pineapple, precisely as requested.

Alison: Hey, how’s it going?

Welcome to Crust Deluxe, Burhan says hello. It’s chilly on the market. How can I enable you?

Customer: May I order a Hawaiian-style pizza with an extra helping of pineapple topping, please?

Burhan: Certain, what measurement?

Alison: Giant.

Burhan: Anything?

Alison: No thanks, that’s it.

Burhan: One thing to drink?

Alison ordered a chilled bottle of Coca-Cola.

Burhan: You bought it. The total will be thirteen dollars and fifty-five cents, plus approximately fifteen minutes of your time.

As each stage of progression is disclosed, the outcome of the transaction becomes progressively clearer, ultimately revealing the definitive culmination of the service provided or product received. Transactional conversations possess distinct characteristics: being straightforward, focused, and concise. They rapidly dispense with pleasantries.

Informational voice interactions

During this interim period, certain discussions focus largely on gathering information. Alison might visit Crust Deluxe solely for the purpose of placing an order, but it’s unlikely that she’d actually leave with a pizza. As she surveys the menu, her main concern will likely be whether the restaurant offers halal or kosher options, gluten-free choices, and so on. Despite initiating a brief, courteous exchange to gauge politeness, our ultimate goal is much greater.

Alison: Hey, how’s it going?

Welcome to Crust Deluxe, Burhan says hello! It’s chilly on the market. How can I enable you?

May I inquire about a few things?

Burhan: In fact! Go proper forward.

Alison asked if there were any halal options available on the menu.

Burhan: Completely! Will we accommodate special requests to make any pie halal? With an array of options catering to various dietary preferences, including vegetarian, ovo-lacto, and vegan choices. Do you find yourself intrigued by an array of specialized eating habits and regimens?

Alison: What about gluten-free pizzas?

Burhan: With certainty, we’ll accommodate your request for a gluten-free crust on both our deep-dish and thin-crust pizzas, no exceptions. What are some potential reasons why users might encounter issues when trying to share links on social media platforms such as Twitter and Facebook?

Alison: That’s it for now. Good to know. Thanks!

Burhan: Anytime, come again quickly!

This situation is utterly peculiar. The goal is to gather a precise and comprehensive set of information. Conversations serve as inquiry-driven journeys into the fabric of reality, meticulously gathering insights, data, and specifics in pursuit of a deeper understanding. Voice interactions providing information tend to unfold at a more leisurely pace compared to transactional conversations due to their inherent nature. Clients appreciate detailed explanations that provide a comprehensive understanding of key points and outcomes.

Ultimately, leveraging conversation enables businesses to effectively support clients in achieving their goals. Although an interface incorporates a voice component, this does not necessarily mean that every user interaction is facilitated solely through voice. Because multimodal voice interfaces often rely on visual cues like screens for support, the result is a less immersive experience. In contrast, books that focus solely on spoken dialogue, devoid of any visual elements, present a unique challenge in terms of comprehension and require a greater degree of cognitive effort to decipher.

While voice-controlled interfaces have long fascinated sci-fi enthusiasts, it’s only recently that these futuristic concepts have become a tangible reality with the emergence of advanced voice interfaces.

Interactive voice response (IVR) techniques

Although conversational interfaces have been staples of computing for decades, voice interfaces originated in the early 1990s with text-to-speech (TTS) dictation software that read written text aloud and speech-enabled in-car systems that provided turn-by-turn directions to users. With the advent of IVR technologies, we gained insight into the potential of primary true voice interfaces engaging in authentic conversations as a viable alternative to overburdened customer support representatives.

IVR systems initially enabled organizations to reduce their dependence on traditional call centers, but they soon gained notoriety for their awkwardness. Within the corporate realm, interactive voice response systems were originally conceptualized as virtual switchboards guiding customers to live representatives (“Please say ‘book a flight’ or ‘explore our itineraries'”); chances are you’ll engage in conversation with one when calling an airline or hotel chain’s customer service department. Despite their limitations and customers’ disappointment at being unable to speak directly with a human representative, Interactive Voice Response systems experienced widespread adoption across various industries in the early 1990s.

While IVR technologies excel at handling highly formulaic and routine interactions that rarely deviate from a predictable script, their conversational tone often falls flat compared to the nuanced exchanges we experience in everyday life or even in imaginative storytelling.

Display screen readers

As IVR technologies advanced, another innovation emerged: an optical-to-audio converter capable of transcribing visual content into synthetic speech. The primary method for blind and visually impaired website users to engage with text-based, multimedia, or interactive elements is through screen reader technology. Display screen readers signify the most comparable modern-day equivalent to an out-of-the-box delivery of content via voice, approximating a real-world representation.

Initially, various display readers were named as such, with one notable example being the Display Reader for the BBC Micro and NEEC Portable, developed by the Research Centre for the Education of the Visually Handicapped at the University of Birmingham in 1986. In that exact same year, IBM’s Jim Thatcher pioneered the development of the first display reader software for text-based computers, which was subsequently adapted for use on GUI-enabled systems.

As internet growth accelerated in the 1990s, a pressing need emerged for user-friendly tools to build and maintain websites. As semantic HTML and ARIA roles gained traction since 2008, screen readers have empowered users to navigate websites in an auditory and sequential manner, rather than relying solely on visual and physical cues, thereby facilitating seamless interactions with the web for individuals with disabilities. Readers of the net are provided with present mechanisms that effectively convert visual design elements such as proximity, proportion, into actionable insights. No less so when documentation is crafted with intention.

While display readers are invaluable for voice interface designers in terms of providing crucial insights, their usability can be a significant concern due to the complexity and wordiness of the content. The visible manifestations of internet websites and online navigation often falter in effectively communicating their design elements to viewers, typically resulting in cumbersome descriptions that enumerate each editable HTML component and herald every styling modification. For many display readers, navigating web-based interfaces can prove cognitively taxing.

In a thought-provoking piece, accessibility advocate and voice engineer Chris Maury questions whether display reader expertise is truly effective for users relying solely on voice.

From the outset, I had a strong aversion to how display screen readers functioned. What drives the design of these systems? It’s illogical to present information visually first, followed by its translation into an auditory format. When investing significant time and effort to develop optimal user experience for an app, it’s often squandered, with the unfortunate consequence of hindering accessibility features for visually impaired users as well? ()

Many times, expertly crafted voice interfaces can efficiently guide users to their destination faster than lengthy, meandering display readouts. Despite everything, users freely navigate the interface, effortlessly searching for relevant information while seamlessly bypassing non-essential content. Blind individuals, meanwhile, are compelled to focus intently on each spoken phrase synthesized into audible language, valuing concise communication above all else. Individuals with disabilities who have historically been forced to rely on cumbersome display readers may now find a more seamless experience through the use of innovative voice interfaces and cutting-edge voice assistants.

Voice assistants

Considering the proliferation of voice interfaces in residences, smart homes, and offices, it’s natural for many to conjure up images of HAL from 2001: A Space Odyssey or envision Majel Barrett’s iconic voice as the omnipresent AI computer in Star Trek. Voice assistants serve as personal attendants that respond to queries, manage schedules, perform searches, and execute a range of daily tasks with ease. As a result, artificial intelligence language models like ChatGenesis are rapidly garnering attention from accessibility advocates due to their considerable potential to facilitate equal access and opportunities.

In 1987, Apple released an iconic demonstration video showcasing its concept for the Information Navigator, a pioneering voice assistant capable of accurately transcribing spoken words and understanding human language with remarkable precision. In 2001, the concept of a Semantic Internet “agent” took shape, with Tim Berners-Lee and others envisioning an entity capable of performing everyday tasks such as checking calendars, scheduling appointments, and finding locations. In 2011, Apple’s Siri finally emerged, transforming voice assistants into a tangible reality for consumers.

Due to the abundance of voice assistants available today, significant differences emerge in terms of programmability and customizability among various voice assistants. Initially, all features outside of those provided by vendors are rigorously secured, as exemplified by the core functionality of Apple’s Siri and Microsoft’s Cortana at launch, which were unable to be extended beyond their inherent limitations. Despite advancements in technology, it remains challenging to program Siri to execute arbitrary tasks because developers lack direct access to the platform’s underlying architecture, aside from pre-defined functionality for tasks such as sending messages, requesting ride-sharing services, booking restaurants, and similar capabilities.

On one end of the spectrum, voice assistants such as Amazon Alexa and Google Home provide a fundamental foundation for developers to build custom voice interfaces. As a result, voice assistants capable of being tailored to individual needs and augmented with custom functionality have gained immense popularity among developers seeking alternatives to Siri and Cortana’s limitations. Amazon offers the Alexa Skill Kit, a development platform enabling creators to build custom voice experiences for Amazon Alexa; meanwhile, Google Home provides the capability to craft custom actions for Google Assistant. Currently, customers have access to a vast array of custom-built skills within both the Amazon Alexa and Google Assistant platforms, boasting hundreds of options to choose from.

Voice assistants like Amazon Alexa and Google Assistant are generally more programmable, rendering them more versatile options compared to their counterpart Apple Siri.

As tech giants Amazon, Apple, Microsoft, and Google continue to assert dominance, they are also proactively releasing and open-sourcing a vast array of tools and frameworks designed to empower designers and developers to create seamless voice interfaces without requiring extensive coding expertise.

Typically, voice assistants like Amazon Alexa are tightly integrated with a device and cannot be accessed on a PC or smartphone independently. As the demand for seamless omnichannel experiences grows, innovative growth platforms such as Google’s Dialogflow are pioneering capabilities that enable customers to build a unified conversational interface, effortlessly transforming into a voice interface, text-based chatbot, and Interactive Voice Response (IVR) system upon deployment. While this e-book doesn’t dictate specific implementation methods, Chapter 4 will explore the potential consequences of these factors on the creation of your design artifacts.

Simply stated, spoken content is delivered through vocal means. To preserve the essence of human dialogue’s captivating nature, voice content must flow effortlessly and naturally, devoid of context or unnecessary elaboration, while written content often falls short in these areas.

The world has become saturated with diverse voice content formats, ranging from text-to-speech algorithms that read website content aloud to intelligent virtual assistants providing weather forecasts and automated phone systems guided by interactive voice response technologies. As we delve into this eBook, our primary engagement is with audio-based content – a requirement, rather than an option.

Many individuals’ initial encounter with information-based voice interfaces is expected to be delivering product information to clients. One potential challenge is that all existing content lacks the necessary preparation to thrive in its newly assigned environment. How do we make the content stuck on our websites more conversational and engaging for visitors? How will we craft new content that naturally adapts to conversational interfaces?

In recent times, the way we fragment and reassemble our content has reached unprecedented levels of complexity. Websites are vast repositories of extended narrative, potentially stretching to limitless horizontal scrolls within a browser window, akin to digital microfilm readers accessing archival databases. In 2002, prior to the widespread adoption of voice assistants, Anil Dash posited the concept of permalinkable content that remains readable across various settings, akin to email or text messages.

Examples of microcontent include a daily weather forecast, time-sensitive details such as arrival and departure instances for an airplane flight, concise summaries drawn from lengthy publications, or brief messages communicated instantaneously. ()

I would refine this statement as: To revamp Sprint’s understanding of microcontent, I’d broaden its scope to encompass the entire spectrum of concise, easily digestible content formats, far beyond just written messages. Despite everything, today’s interfaces feature isolated snippets of copy, akin to textbots confirming restaurant reservations.

By leveraging microcontent, you can push the boundaries of your content’s potential, providing valuable insights that inform both traditional and innovative supply chains.

As microcontent, voice content material stands out because it showcases the artistry of storytelling in a unique, intimate setting rather than on a larger-than-life stage. While awaiting a train’s arrival in real-time through an underground digital signal, voice interfaces, on the other hand, hold our attention hostage for prolonged periods, leaving us unable to quickly glance away or bypass them, as readers familiar with screen-based communication can attest.

As a result of microcontent is essentially composed of disconnected fragments with no inherent connection to the channels where they’ll ultimately reside, we must ensure that our microcontent truly excels as voice content – and that means focusing on the two most critical attributes of effective voice content: clarity and coherence.

The legibility and discoverability of our voice content are directly tied to how voice content is experienced in terms of perceived time and spatial context.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles