Texthelp SpeechStream Overview

Rev #8 – November, 2011 

Texthelp SpeechStream Overview 

Table of Contents 

1. Introduction ............................................................................................................................................. 2 

2. SpeechStream Server .............................................................................................................................. 3 

2.1 Caching ....................................................................................................................................... 3 

2.2 Speech Server Configuration Options ........................................................................................ 4 

2.2.1 Explanation of Terms ............................................................................................................. 4 

2.2.2 Texthelp-Hosted Speech Server ............................................................................................ 5 

2.2.3 Texthelp-hosted Speech Server With External Cache ........................................................... 7 

2.2.4 Customer-Hosted Speech Server ......................................................................................... 10 

2.3 SpeechStream Server Specification and Performance ............................................................ 12 

2.3.1 Hardware and Operating System ........................................................................................ 12 

2.3.2 Text To Speech Performance ............................................................................................... 12 

2.4 Cache Server Specification and Performance .......................................................................... 13 

2.4.1 Scalability ............................................................................................................................. 13 

3. End-user software ................................................................................................................................. 14 

3.1 SpeechStream Toolbar (HTML) ................................................................................................ 14 

3.1.1 Web Browser Compatibility ................................................................................................. 15 

3.2 Flash ......................................................................................................................................... 16 

3.3 Custom access .......................................................................................................................... 16 

© Copyright Texthelp Systems Ltd. 2011 

TextHELP Systems, Inc. 

Tel: (888) 248-0652 • Fax: (866) 248-0652 • u.s.info@texthelp.com • www.texthelp.com

1. Introduction 

The Texthelp SpeechStream Server delivers high quality computer-generated speech for web-based 

applications, complete with synchronized, dual-color, word-by-word highlighting. 

It does not require the installation of any speech software on end-user computers. 

Example of dual-colored highlighting 

The solution is scalable, can be used in a variety of application platforms, and is simple for the customer 

to implement. 

The SpeechStream Server solution consists of the following major components: 

� The SpeechStream Server itself (which actually generates the audio) 

� An optional speech cache device (to improve performance for repeat requests) 

� End user software (to communicate with the server and deliver the audio in the customer 

application) 

Supported application environments include: 

� HTML – a fully featured speech toolbar (the SpeechStream Toolbar) can be easily integrated into 

existing customer web pages. 

� Flash – Toolbar can be accessed by embedding in web page and making function calls from inside 

of flash. Additional direct server calls can be made. 


Tel: (888) 248-0652 • Fax: (866) 248-0652 • u.s.info@texthelp.com • www.texthelp.com 

Page 2 of 16

2. SpeechStream Server 

The SpeechStream Server is a dedicated computer that carries out the following functions: 

� Accept speech requests from the client application 

� Use a text-to-speech engine to generate audio for the supplied text 

o This can apply pronunciation rules to correct how words are spoken 

� Supply an audio file (MP3) and timing information (XML) so the web application can stream the 

audio and highlight text as it is spoken. 

2.1 Caching 

Speech generation and conversion of output audio to MP3 files is computationally expensive. There are 

two potential “bottlenecks” with a speech server system: 

� High load – when the number of users accessing the speech server is very high. 

� Text-To-Speech performance – slower Text-To-Speech voices may not support large numbers of 

simultaneous users 

By using a cache, repeat requests for the same text can bypass the speech generation process entirely. 

In most speech-enabled applications the content is largely static, and a speech cache is highly 

beneficial. 

If a particular speech engine has a lower level of performance, the audio content can be generated in 

advance by reading through all the content to ensure it is 100% cached. The speech engine on the 

server is then only used when new content is generated or existing content is updated. 

� The speech server itself has a built-in cache that it uses to improve repeat requests for a pregenerated 

text string. 

� The speech server can also be configured to use an external cache, entirely separate from the 

speech server for even faster performance in high load scenarios. 



Page 3 of 16

2.2 Speech Server Configuration Options 

There are several different configurations of server are possible, depending on customer requirements 

such as expected user load, dynamic versus static content and size of content. 

Texthelp can advise customers which configuration best suits their needs. Text can also offer 

consultation and advise on suitable customized solutions if the application does not fit exactly into the 

three configurations described here. 

2.2.1 Explanation of Terms 

Dynamic content is content that will change from one user session to the next. User-typed text, such as 

that typed in a form field on a webpage, is considered dynamic. Pages created from a content 

management system (such as a commercial website, or even blog-type material) are also dynamic. 

Static content is content that remains the same, for all users, apart from occasional updates (such as 

corrections or new material). 

An article is a notional quantity of content, equivalent to an A4 page. 

Content size is a reference to the amount of textual content in a speech-enabled system. This is not a 

precise measure. Examples are as follows: 

� A web application with 100s of individual articles would be considered small. 

� A web application with 1000s or 10,000s of individual articles would be considered medium. 

� A web application with 100,000s of individual articles or more would be considered large. 

A cache server is a simple web server which acts as a file store for audio files. It does not require any 

special software or any royalty-bearing software – it could be a Linux server if required. 



Page 4 of 16

2.2.2 Texthelp-Hosted Speech Server 

This configuration is ideal for low usage scenarios, where a customer wants to add speech to a 

relatively lightly-used system, with a small amount of content (either static or dynamic). 

It is also useful for a prototype implementation of the speech server, before stepping up to a more 

scalable final implementation. 

� All speech server resources are provided by Texthelp. 

� End-user software (such as the SpeechStream Toolbar) is included in the customer web pages. 

� The speech server has an integrated cache to improve performance 

� There is no additional cache for audio files. 

1. User accesses customer 

website 

2. Webpage is rendered by 

server and displayed to 

user in web browser 

3. User invokes speech via 

UI on website. 

4. Texthelp software on 

webpage communicates 

with remote SpeechStream 

server 


webpage highlights text and 

plays the audio to user 

Advantages: 

� Simple integration for customer 

� Ideal for lower volumes of usage 


Customer site 

Customer Web Server 

Texthelp 

Workflow for Texthelp-hosted speech server 

� No requirement for customer to host servers on-site 

SpeechStream Server 


Page 5 of 16

� No specialist technical resources are required to manage the servers 



Page 6 of 16

2.2.3 Texthelp-hosted Speech Server With External Cache 

This configuration is intended for medium to high usage scenarios, with a medium volume of content 

that is mainly static. 

� A speech server is provided by Texthelp. 

� A cache server is provided by the customer. 

� Texthelp end-user software (such as SpeechStream Toolbar) is included in the customer web 

pages. 

o This will access the cache server for each audio request. If the required audio is not in 

the cache, the software will communicate with the remote speech server. 

o The Texthelp speech server will then stream the audio to the end user. It will also 

transfer the audio files to the customer cache server for subsequent speech requests. 


website 



user in web browser. 




looks for audio on cache 

server 

6. If the audio is not cached, 

Texthelp software requests 

audio from remote 

SpeechStream server 





Customer site 


Texthelp 

Cache server 


8. After the audio is generated 

for live playback, it will be 

transmitted to the cache server 

for repeat requests. 

5. If audio is in cache, access it directly and play back to user with color highlighting. 

Workflow for Texthelp-hosted speech server with external cache 


Page 7 of 16


Continued overleaf � 


Page 8 of 16

Advantages: 

� Customer site only requires a simple web server to act as a cache. 

� This gives the advantage of fast access to pre-cached content for the majority of speech requests, 

without the need to manage a more complex speech server and pay royalties for Windows-based 

software. 



Page 9 of 16

2.2.4 Customer-Hosted Speech Server 

This configuration is intended for high usage scenarios, with a high volume of content. Overall 

implementation is similar to the Texthelp-hosted speech server with external cache described 

previously except all the software and hardware is managed by the customer (with assistance from 

Texthelp). 

� SpeechStream Server software provided by Texthelp is installed on a customer server. 

� Optionally, a cache server is provided by the customer. 

� End-user software (such as the SpeechStream Toolbar) is included in the customer web pages. 

o This can access a cache server if required 

o The speech server is located at the customer site 

o A cache server can be updated across the network immediately rather than using FTP 

from a remote Texthelp server. 


website 



user in web browser. 




looks for audio on cache 

server (optional) 

6. If the audio is not cached, 

Texthelp software requests 

audio from customer’s 

SpeechStream server 





Customer site 


Cache server (optional) 

8. After the audio is generated 

for live playback, it will be 

transmitted to the cache server 

for repeat requests. 


(hosted by Customer) 

5. If audio is in cache, access it directly and play back to user with color highlighting. 

Workflow for Customer-hosted speech server with optional cache server 


Page 10 of 16

Advantages: 

� Maximum performance for customer – dedicated speech server 

� Optional cache server can be used to maximize performance in large deployments. 



Continued � 

Page 11 of 16

2.3 SpeechStream Server Specification and Performance 

SpeechStream server performance depends on two main variables: 

� Physical specification of the server that the Texthelp Speech Server is installed on 

� Performance of the specific text to speech engine being used 

2.3.1 System Requirements 

The speech server must be installed on a 32 bit Windows server. Texthelp currently recommends 

Windows Server 2003. Both dedicated servers and cloud based servers are supported. 

2.3.2 Text To Speech Performance 

Performance characteristics of Text To Speech can differ between vendors and even between different 

voices from a single vendor. Support for multi-threading and multi-core processors can vary. Texthelp 

can recommend the best voice for your implementation. 

Using a standard Nuance voice, Scansoft Jill (American English Female), a server as detailed above will 

generate up to two million speech requests for average length sentences in a 24 hour period. 

Some speech engines may not equal this level of performance. Normally, this can be mitigated through 

use of one of the caching solutions outlined previously, where end users will only access the pre-cached 

audio rather than requiring live speech generation. 



Page 12 of 16

2.4 Cache Server Specification and Performance 

For scenarios where a cache system is being configured, a second server is necessary. 

A live speech server is responsible for the generation of audio data and conversion to MP3 format for 

playback by the end user software. Speech generation and MP3 conversion are both very expensive in 

terms of computer resources; in contrast, the cache is just a file store, and does not require the same 

level of heavyweight processing power as the live speech server. 

System Requirements 

� Server running web server (recommend Apache, can be any operating system) 

� FTP access 

� Disk space requirements depend on the website content. 

Typical figures for disk space requirements suggest: 

� A typical sentence of text returns 30KB of data (this is one speech request) 

� A typical page of content contains around 100 sentences – requiring around 3MB. 

� This can then be multiplied by the number of pages of content that are speech-enabled. 

� The resulting value indicates the current minimum disk space required. Room for growth should 

be considered, as should any space required for the operating system and web server. 

This does not consider the requirements of additional playback speeds or additional voices. If one 

sentence of text requires 30KB, then two will require 60KB, three will require 90KB, etc. 

Actual values also depend on the specific voice being used and the complexity of the text content. 

2.4.1 Scalability 

When the cache server capacity is reached, then further capacity should be obtained using a load 

balancing. There are two ways to implement this: 

� Via a hardware load balancer, with cache data synchronized between the cache servers. 

� The end user application can direct different groups of users to alternative cache servers 



Page 13 of 16

3. End-user software 

In addition to the speech server itself, Texthelp also provides software to enable customer applications 

to offer speech easily. 

3.1 SpeechStream Toolbar (HTML) 

For HTML-based applications, the SpeechStream Toolbar offers a simple method to add speech support 

to your application. This toolbar is provided as JavaScript that is easily added to any webpage. 

The implementation offers: 

� Speech support toolbar, consisting of: 

o Speak text that the user clicks with the mouse 

o Speak text selections 

o English to Spanish single word translation (other languages available on request) 

o Fact Finder (look up selected text on a specific search engine) 

o Dictionary to provide definitions for English words from a 100,000 word dictionary 

(customizable on request) 

o Four color highlight options to annotate text 

o Clear highlights/collect highlights option 

� Buttons can be hidden if required 

� Color highlights can be persisted on a server 

� Voice speed can be adjusted by the user 

The toolbar is highly customizable. You can: 

� Hide or show buttons using JavaScript 

� Hide the toolbar completely and call the functionality from JavaScript (useful if you want to design 

custom UI for speech, or create a UI that closely matches your own) 

� The toolbar can be docked at a static location on the page. 

� The toolbar appearance (colors and graphics) can also be customized if required. 

� A speech bubble mode is also available for minimal user interface implementations 



Page 14 of 16

Please note: The SpeechStream Toolbar can only read HTML text content and alt tags on images. It 

cannot read embedded Flash objects, PDF documents, ActiveX objects, Java objects or any non-text 

content. 

Other features of the SpeechStream Toolbar and Server combination are: 

� Your application can permit the user to change the voice if required. Otherwise, the application 

can use a pre-determined voice configuration: 

o Voice gender can be changed (a variety of male and female voices are available) 

o Voice speed can be changed (some readers prefer a slower speed to aid 

comprehension) 

o The language can be changed (Spanish, French and other non-English languages are 

available) 

� Pronunciation can be fine-tuned in cases where uncommon words are incorrectly pronounced by 

the text to speech engine. 

o Examples of this include scientific terms, names or abbreviations. 

3.1.1 Web Browser Compatibility 

The SpeechStream Toolbar will work on the following operating system and browser combinations. 

Adobe Flash 8, 9 or 10 is required in all cases. 

� Windows: 

o Internet Explorer 

o Firefox 

o Google Chrome 

� Apple Macintosh: 

o Firefox 

o Safari 

o Google Chrome 

Support for newer versions of these major browsers will be added as soon as possible. 

Please contact your Texthelp representative if you require further clarification of the browser support 

policy. 



Page 15 of 16

3.2 Flash 

SpeechStream speech servers can also be accessed from Flash applications. 

Due to the nature of Flash applications, it is not possible to provide a generic solution for speech with 

dual-colored word highlighting. Unlike HTML, Flash applications do not have a standard DOM 

(Document Object Model) that can be used in a generic speech solution. 

Implementation of the user interface, text display and interaction with the user is therefore the 

responsibility of the Customer’s software developers. 

Texthelp can provide support for speech-enabling text boxes in both AS2 and AS3. Direct access to the 

speech server is also possible, enabling the Customer to provide as much or as little speech as required. 

Contact your Texthelp Representative for further details. 

3.3 Custom access 

Some applications do not suit either the HTML-based SpeechStream Toolbar or the Flash approach. An 

example of this would be an application developed in Java. 

For these applications, Texthelp can supply direct access to SpeechStream servers to obtain speech 

directly. Playback of the audio and user control is entirely the responsibility of the Customer’s 

application. 



Page 16 of 16

Texthelp SpeechStream Overview

Create successful ePaper yourself

Delete template?

Save as template?