The CAAPR Pronunciation Reference

Alan Beale
15 August 2019

This page describes CAAPR, the Combined Anglo-American Pronunciation Reference. CAAPR is a pronouncing dictionary for both British English (RP) and American English (GA). It is written in a compact and easy-to-read notation which systematizes the differences between these two varieties of English, while still handling exceptions gracefully and accurately. The CAAPR notation is itself called CAAPR, this time standing for Combined Anglo-American Pronunciation Representation. It is in many ways a generalization of my FLOSS notation. To some extent, the CAAPR list resembles the FEWL list, but it avoids some of the complexities of that list by concerning itself almost exclusively with phonological rather than morphological information.

One may reasonably ask the question: What is CAAPR good for? It is possible that, like FEWL, CAAPR may in time come to be useful for computer generation of dictionaries for reformed English orthographies. At this time, however, this seems premature, especially since most reformers are unwilling to take on the labor of trying to please both sides of the Atlantic at once. I see CAAPR mostly as a useful tool for self-education. In compiling it, I have learned a lot about the systematic differences between the two English varieties, and I recommend it to anyone else who feels the need for greater insight in this area.

CAAPR is based on two primary sources, the FEWL list for American English and the online EPD dictionary for British English. This latter dictionary is of very high quality, and I wish I knew who collected it so I could offer them my extravagant thanks. During the development of CAAPR, I transformed, rearranged and occasionally corrected the EPD, but I find it remarkable how few corrections were needed. Whatever merits CAAPR may possess are in large measure derived from the labor of the unknown contributors to the EPD.

CAAPR actually consists of three word lists: CAAPR-A, CAAPR-B and CAAPR-C. CAAPR-A(merican) is a list of words with their GA pronunciations, derived via reformatting from the FEWL list. CAAPR-B(ritish) is a somewhat different list of words with their RP pronunciations, derived via subsetting and reformatting from the online EPD dictionary (plus some additional common words unaccountably omitted from that document). CAAPR-C(ombined) is a combined list, comprising the words which are both in CAAPR-A and CAAPR-B, showing both pronunciations together in a single notation. As will be described here, the CAAPR notation is slightly different for each document, in ways that are unlikely to give the user any difficulty. The CAAPR-A and CAAPR-B lists each contain approximately 30,000 words. The CAAPR-C list contains the approximately 28,000 words common to both the A and B lists. (All of these lists now include a relatively small number of additional words from other sources.)

The CAAPR lists show a single (RP or GA) pronunciation for each word on the list. Of course, the world is much more complicated than this, because many words have multiple acceptable pronunciations, with conflicting data about which is preferable or most prevalent. Both lists use the technique of lexicographic consensus to resolve such questions. Note that differences between the GA and RP pronunciations listed in CAAPR may not reflect actual Anglo-American differences. For instance, both forms may represent pronunciations which are common in both varieties, distinguished more by happenstance than by geography.

The pronunciations in CAAPR-A were determined by consensus of the following dictionaries: The Longman Pronunciation Dictionary, the Merriam-Webster Collegiate Dictionary CD-ROM, and the American Heritage Dictionary CD-ROM. In difficult cases, the Random House Unabridged Dictionary CD-ROM was also consulted. Similarly, the pronunciations of CAAPR-B were determined by consensus of the online EPD, the Longman Pronunciation Dictionary and the Shorter OED CD-ROM. In difficult cases, the Cambridge Pronunciation Dictionary was also consulted. The Longman Pronunciation Dictionary is the most precise of all these sources, and was often used to resolve issues which could not be easily settled using the other, less technical sources.

Each list has a similar format. A typical entry from the CAAPR-A and CAAPR-B lists looks like this:

wordplay : w&'dpLE·

The entry is divided by the delimiting string " : " into the traditional spelling and the CAAPR representation of a word. In some cases, like the following:

ill-use (n) : i'LyU's
ill-use (v) : i'LyU'z

different forms of the same word may be distinguished by a qualifying word or phrase in parentheses.

The format of the CAAPR-C list is similar. Here are a few example lines:

combat (n) : kombat
combat (v) : k[ø|o]mbat *

Some entries may be followed by an asterisk, which indicates that the American and British stress patterns for the word are different. The programming which determines this is still under development, and the absence of the asterisk cannot be trusted to mean that the two pronunciations are in fact similar in their stress. (Note that the CAAPR combined notation does not directly indicate stress, due to technical difficulties.)

The three CAAPR lists can be downloaded using the following links:

CAAPR-A - the American (GA) list
CAAPR-B - the British (RP) list
CAAPR-C - the combined list
CAAPR-all - all three lists in a single .zip file

All the above files include a copy of this page for easy reference.

An Example of CAAPR

CAAPR is not intended to be used for transcription of continuous text. Nevertheless, it is useful at this point to give you an idea of the overall appearance of CAAPR, and I know no better way to do this than to transcribe a short bit of prose. Here is the first paragraph of H.G. Wells' "The Star" (see here for a plain English version, among others) written in CAAPR-C. It may seem cryptic at first glance, but as one uses CAAPR, one rapidly becomes accustomed to its conventions, and soon such passages present little mystery.

it w[u/o]z on Dø f&ßst dE øv Dø n!U yïr Dat Dý ønWnsm°nt w[u/o]z mEd, ØLmOst s[Y/i]m°LtEnÿøsLý fr[u<o]m TrI øbz&ßvøtòrý$, Dat Dø mOX°n øv Dø pLanît nept!Un, Dý WtRmOst øv ØL Dø pLanît$ Dat µIL øbWt Dø sun, had bikum verý iratik. ø ritAßdEX°n in its v3Losêtý had b[i\I]n søspektîd in disembR. Den, ø fEnt, rimOt spek øv LYt w[u/o]z diskuvRþ in Dø rIj°n øv Dø pRt&ßbþ pLanît. at f&ßst Dis did not kØz ený verý grEt iksYtm°nt. sYøntifik pIp°L, hWevR, fWnd Dý inteLîj°ns rimAßkøb°L inuf, Iv°n bifØß it bikEm nOn Dat Dø n!U bodý w[u/o]z rapîdLý grOiG LAßjR and brYtR, and Dat its mOX°n w[u/o]z kwYt dif°r°nt fr[u<o]m Dý ØßdRLý pr[o|O]gr[ø<e]s øv Dø pLanît$.

Notations on this page

The rest of this page uses certain notations to add precision to the discussion. English words, used as examples, are enclosed in angle brackets, like <this>. CAAPR representations of words are enclosed in double brackets, like «Di's». Individual CAAPR symbols or symbol sequences are enclosed in apostrophes, like 'sO'. Sampa phonemic transcriptions are enclosed in slashes, like /soU/. Individual letters or short sequences from traditional spelling are generally written without any punctuation, as in "the letter t" or "the sequence ng".

The CAAPR Notation

As used in the CAAPR-A and CAAPR-B lists, the CAAPR notation is mostly phonemic, with certain non-phonemic notations added. Because the sound repertoires for GA and RP are different, the same symbol will sometimes have a distinct (but related) meaning for the two varieties. For both varieties of English, not all speakers have exactly the same phonemes. CAAPR-A and CAAPR-B target idealized speakers of GA and RP respectively. The GA pronunciations are based on an idealized American who distinguishes <which> and <witch>, <marry> and <merry>, and <cot> and <caught>, and for whom the two vowels of <above> are distinct, as are the two vowels of <murder>. Similarly, the RP pronunciations are based on an idealized Briton who distinguishes <candid> and <candied>, and for whom the two vowels of <murder> are distinct. Speakers with fewer phonemes than the ideal can merge symbols as necessary to represent their own speech.

Note that CAAPR is not suitable for use as a spelling system. Quite apart from its complexity, it often requires distinct spellings for a single sound, which cannot be resolved by referring to the speech of any particular speaker. From the perspective of a learner rather than of a linguist, the distinctions would seem quite arbitrary.

CAAPR uses a large repertoire of symbols, including upper and lower case alphabetic characters, punctuation, and letters with diacritics. The symbols are organized into groups so that the members of each group are somewhat similar, making it easier to master the entire system. As with any complex system, there are occasional exceptions to this organization, as described below.

The symbol groups and their significance is as follows:

The lower-case alphabetic letters. Each symbol in this group is assigned its natural English phonemic meaning. All the vowels are short. (Note that some letters, notably c, q and x, are omitted.)
Upper-case alphabetic letters, plus a few special symbols and punctuation characters. The symbols in this group are assigned meanings that are usually related in some fashion to the corresponding letter (or, in the case of symbols and punctuation, a letter they resemble in shape). Most of the English long vowels, diphthongs, and less common short vowels fall into this group.
Letters with a dieresis (such as ë and Ü). These generally indicate vowels or diphthongs that occur primarily before the letter r. There is usually a resemblance in sound to the unaccented letter.
Letters with a circumflex (such as ê and û). These generally indicate sounds which are different between American and British English, except for ê and î, which indicate indistinct sounds within both American and British English. There is usually a resemblance in sound to the unaccented letter.
Letters with a grave accent (such as è and ò). These indicate sounds which not only differ between American and British English but are also differently stressed. When stressed, there is generally a resemblance in sound to the unaccented letter.
The letters ý, and Ý. These ought to be written as ŷ and Ŷ, but as these are not in the standard Latin-1 character set, the acute accented y is used instead.
The special characters ', ·, $ and þ. The first two characters are stress marks, and the latter two serve the purpose of identifying plural and past tense inflections.

CAAPR Phonemic Symbols

The following table shows the phonemic symbols of the CAAPR notation. In some cases, as noted, a symbol is phonemic only for one of the English varieties. In such cases, the symbol may be used for the other variety with a similar but non-phonemic meaning.

Symbol	Sampa	Example	Applies to	Notes
a	{	ka't (cat)	Both
ã	A~	elã' (elan)	Both	(1)
A	A:	fA'Døß (father)	Both	(2)
b	b	bE'bý (baby)	Both
C	tS	Ce'LO (cello)	Both	(3)
d	d	de'd (dead)	Both	(4)
D	D	Da't (that)	Both
e	E, e	e'g (egg)	Both
ë	e@	bë'ß (bear)	Brit	(5)
E	eI	ka'nøpE (canape)	Both	(6)
f	f	fY'f (fife)	Both
g	g	ga'g (gag)	Both
G	N	si'GiG (singing)	Both	(7)
h	h	hO'm (home)	Both
H	~	u'HuH (uh-uh)	Both	(8)
i	I	bi'g (big)	Both	(9)
ï	I@	pï'ßs (pierce)	Brit	(10)
I	i:	møXI'n (machine)	Both	(11)
j	dZ	ju'j (judge)	Both
J	Z	vi'J°n (vision)	Both	(12)
k	k	ki'k (kick)	Both
K	x	Lo'K (loch)	Both	(1)
L	l	Li'Lý (lily)	Both	(13)
m	m	me'mbøß (member)	Both
n	n	nu'n (none)	Both
o	Q	to'p (top)	Brit	(14)
õ	o~	kõ'nsýëßJ (concierge)	Both	(1)
ø	@	sO'fø (sofa)	Both	(15)
O	oU, @U	rO'd (road)	Both
Ø	O:	pØ'z (pause)	Both
p	p	po'p (pop)	Both
Q	OI	kQ'n (coin)	Both	(16)
r	r	rØ'riG (roaring)	Both	(17)
&	3`, 3	rif&'r°l (referral)	Both	(18)
s	s	sØ's (sauce)	Both	(19)
t	t	ti'Lt (tilt)	Both	(4)
T	T	Ti'k (thick)	Both
u	V	fu'z (fuzz)	Both	(20)
U	u:, u	sU'p (soup)	Both	(21)
Ü	U@	øbskyÜ'ß (obscure)	Brit
v	v	va'lv (valve)	Both
V	U, u	gV'd (good)	Both	(20), (21)
w	w	wE'wøßd (wayward)	Both	(22)
µ	hw, W	µi'C (which)	Amer	(23)
W	aU	frW'n (frown)	Both
X	S	no'kXøs (noxious)	Both	(24)
y	j	yu'mý (yummy)	Both
Y	aI	Y's (ice)	Both
z	z	zi'gzag (zigzag)	Both	(19)

CAAPR Ortho-phonemic symbols

Both of the English varieties targeted by CAAPR have some unique sounds whose use is generally predictable from the spelling of the words which contain it. For instance, the sound designated by CAAPR 'µ' does not occur in British English, but one can predict, in almost all cases, that a word pronounced with a /w/ in British English, but spelled with wh, will be pronounced as 'µ' in American English (at least by those Americans who use that sound). I call the symbols with this property ortho-phonemic, as they have phonemic significance in one variety, but orthographic significance in the other variety.

The use of the ortho-phonemic symbols in CAAPR brings the American and British spellings closer to one another, in a way that makes sense even for speakers not familiar with the other variety.

CAAPR Special symbols

CAAPR uses a number of additional non-phonemic symbols for various purposes. These symbols are listed in the table below, and explained in the following notes.

CAAPR-C - Putting it all together

CAAPR-C is the combined CAAPR notation, which attempts to merge the American and British spelling for each word, producing a reasonable composite. The process works as follows.

First, stress marks, which are not used in CAAPR-C, are dropped. Next, if the CAAPR-A transcription uses the '&' symbol, it is replaced by '&r'. Then, if the remaining transcriptions are identical (as for the word <soggy> - «sogý»), this is the CAAPR-C representation. If the revised transcriptions are not identical, then corresponding characters which are different are collected into a bracketed pair, first the American version, and then the British one. For instance, consider the word <forecast>. The American «fØrkast» and the British «fØßkAst» are combined into «fØ[r,ß]k[a,A]st». This may possibly be the end of it, but usually it is not. In many cases, this combined transcription will contain pairs which are common enough that there are rules for replacing them with a single letter. For <forecast>, we have two pairs, [r,ß] and [a,A]. Almost always, a British 'ß' will be paired with an American 'r', and the symbol 'ß' will be used as the combined representation. This reduces <forecast> to the string «fØßk[a,A]st». But the combination [a,A] is also very frequent, occurring in words like <bath>, <class>, <shaft>, etc. For this reason, the combination is given the representation 'â' in CAAPR-C. So the final CAAPR-C version of <forecast> is «fØßkâst».

If there are any bracketed pairs that cannot be reduced to a single symbol in this fashion, the CAAPR-C allows the comma between the symbols of the pair to be replaced by a character indicating whether one of the pronunciations indicated may be more generally recognizable than the other. This process and the additional symbols it uses is described in a later section.

Stress information is dropped from CAAPR-C because of the incompatible systems used in CAAPR-A and CAAPR-B. However, the process of determining the composite CAAPR-C representation will usually notice if the stress has changed in a significant way; these words are marked in the list with an asterisk. About 1 in every 40 words is marked like this.

This process introduces a new class of CAAPR symbols, which I call "synthetic" symbols, as they represent a synthesis of an American and a British pronunciation. Some of the symbols (such as 'ß') are extended in meaning in a natural way, while others, like 'â', are new symbols introduced explicitly to represent a common pair.

CAAPR Synthetic Symbols

The following table defines the CAAPR-C synthetic symbols in terms of the corresponding pairs of symbols they replace:

About 1 in 14 words in the CAAPR-C list contain symbol pairs that cannot be reduced to synthetic symbols. Without the use of the synthetic symbols, the percentage of differences would be very much higher.

CAAPR-C Embellished Pair Notation (Dominance and Equivalence)

Most English words have a CAAPR-C representation without any symbol pairs, meaning that their British and American pronunciations differ only in the typical ways cataloged by the synthetic symbols above. A small number of words, however, have differences whose low frequency makes it impractical to define single symbols for them. If one is seeking the holy grail of an orthography that will have a single workable spelling for all of English, then one naturally asks of such words whether there is additional information that would allow one to choose between the two incompatible pronunciations. The answer, as it happens, is "Maybe".

It may happen that, in one of these words, one or both of the pronunciations may have some international recognition. Here are some simple words illustrating the possibilities:

The CAAPR-C list embellishes the representations of words like these by using another symbol in place of the comma within pairs to indicate equivalence or dominance of the two pronunciations, as follows:

Of course, these embellishments can and should be ignored by those uninterested in this additional distributional information, and in general when I cite CAAPR spellings, I use comma separators except in cases where the embellishments are of interest.

Changes from version 1

Version 2 of CAAPR differs from version 1 in two important regards. The more important of the two is that the CAAPR-B list has been revised to mark use of the indistinct i, as well as using the plural and past tense symbols '$' and 'þ'. This makes the A list and the B list equivalent in terms of the amount and style of information presented.

The other change consists of enhancements to the notation itself. Most of the enhancements related to finer classification of indistinct, unstressed sounds (the symbols '°', '*', '¹', '³', 'ÿ', '3', 'î' and 'R'). Also, the embellished symbol pair notation was introduced, and the two symbols 'à' and 'Ÿ' were dropped, the former because there was no meaningful distinction from 'è', and the latter as a side-effect of the introduction of 'R' («LYR» is a better representation of <liar> than «LŸß»).

It may be questioned whether the increased precision in the marking of indistinct sounds is really a good thing. Does the difference between «E'prøn» and «E'pr°n» really matter? One answer is that some folks think it does, and will argue with you for as long as you want about whether «je'nørøL» or «je'nrøL» is more correct. But I think the real benefit of this degree of precision is an ironic one: it emphasizes how much uncertainty and variance there is in the pronunciation of unstressed sounds. I developed CAAPR in the hopes it would be useful to spelling reformers. One of the problems with many spelling reforms is that they end up reflecting the minutiae of their inventor's dialect, setting forth as certain that one of »rabit« or »rabut« is correct, and the other a blatant error. CAAPR's rather thorough demonstration of the uncertainty of English pronunciation serves as a persuasive argument for abandoning the pure phonetic principle for the spelling of unstressed syllables. I think this lesson is an important one, and that any practical reformed spelling system for world English must take it seriously.

One other area in which I have enhanced version 2 of CAAPR is that, despite the internationality of the CAAPR-C notation, the version 1 CAAPR-C list included only traditional American spellings. That is, it had an entry for <color>, but not for <colour>. This flaw has been remedied in version 2.

Note that as of March, 2007, I have changed the format of the lists slightly, to make it easier to transfer their content into Microsoft Excel.

Final comments

Version 1 of the CAAPR list included a number of "signature words", whose CAAPR representation deviated from the rules, generally in order to give greater consistency to the spelling of related words. Except as the result of errors, this version contains no such words. CAAPR is not intended as a spelling system, and the inconsistencies of the language itself as well as of the sources of CAAPR should not be obscured. There are good reasons to spell <princess> and <duchess> with the same ending in a practical orthography, but it seems best to leave the data alone, and represent them as «pri'nses» and «du'Cis» in CAAPR-B, which after all is a reference notation and not a spelling system.

Because this version of the CAAPR notation is more complicated than the previous version, I will continue to make the version 1 lists downloadable here. I note that, in addition to its advantage of simplicity, the version 1 CAAPR-B list takes no account of the "indistinct i", which may be of use to those who doubt its existence.

Note: The dictionaries were updated in 2019 by the addition of a significant number of additional words, most of them frequently used capitalized words, as well as the correction of a few errors. I am not calling this update "version 3", as the CAAPR notation itself was not changed.

Though this version of CAAPR has been thoroughly proofread, it is still likely to contain errors and other faults. Thus, you should inform me when you encounter errors, whether isolated or systematic. If you discover ways in which CAAPR could be changed to improve its usefulness, I'd also like to hear of them. Sometimes I suspect that my work on this site is no more than talking to myself in public. If this is not so, and there are ways I can make my forays into dictionary building more generally useful, it would be a shame if no one bothered to tell me.

Symbol	Sampa	Type	Example	Notes
ê	@, I, 1	Indef. sound	ma'gnêt (magnet)	(1)
ý	I, i, i:	Indef. sound	ha'pý (happy)	(2)
ÿ	I, i, j	Indef. sound	prI'vÿøs (previous)	(3)
°	(@)	Optional sound	ma'jik°Lý (magically)	(4)
*	(@)	Optional sound	tY'*L (tile)	(5)
¹	(@), (I), (1)	Optional sound	kri'm¹n°L (criminal)	(6)
þ	d, t	Morpheme	dra'gþ (dragged)	(7)
$	s, z	Morpheme	dru'g$ (drugs)	(8)
'	" (or ')	Stress	øLY'v (alive)	(9)
·	% (or ,)	Stress	do·mênE'X°n (domination)	(9)

Symbol	Replaces	Example	Notes
â	[a,A]	kLâs (class)
3, ê, î	A mixture of i, ê and ø, unstressed	paL3t (palate), sIkrêt (secret), bLaGkît (blanket)	(1), (2)
è	[e or ë, ø or ° or {no sound}]	sekrêtèrý (secretary)	(3)
¹ or ³	A mixture of i, ê, ¹, ø, ° or {no sound}, unstressed	fert¹LYz (fertilize), kuz³n (cousin)	(1)
!	[,y] before U or V or Ü or ø	d!Utý (duty)	(4)
ô	[Ø,o]	krôs (cross)
ò	[Ø, ø or ° or {no sound}]	mandøtòrý (mandatory)	(3)
° or *	A mixture of ø, ° (or *), or {no sound}	Epr°n (apron)	(1)
R	[°r, øß]	piCR (pitcher)	(5)
ß	[r,ß]	mAßk (mark)	(1)
ü	[&,u]	würý (worry)
û	[ø,V]	regyûLR (regular)
Ü	[V,Ü]	pyÜß (pure)	(1)
V	[U,V] before vowel	v&ßCVøs (virtuous)	(6)
ÿ	A mixture of y, ÿ or ý	yUnÿøn (union)	(1)
Ý	[ê or ø, Y or Y*]	ØßgønÝzEX°n (organization)

CAAPR - A Combined Anglo-American Pronunciation Reference