English Sound-Symbol Correspondences

Traditional English spelling is complicated.  Many defenders of traditional spelling seem unaware of the full extent of its complexity.  With the help of the FEWL dictionary and a good bit of programming, I've put together some tables showing just how complicated the relationship between spelling and pronunciation is.  These tables are based on the 30,000 FEWL dictionary - using a larger word list would of course change the results (and not by making them simpler).

There are four tables, described here.  I recommend you read the detailed explanations below before following the links.

  1. A table of phonograms and their pronunciations, sorted by spelling.

  2. A table of English sounds and their spellings, sorted by pronunciation.

  3. A table of phonograms and their pronunciations, sorted by unweighted frequency, that is, the frequency of their occurrence in the set of words in the FEWL word list.

  4. A table of phonograms and their pronunciations, sorted by weighted frequency, that is, the estimated frequency of their occurrence in English text.

Because a number of compromises and shortcuts have been made in building these tables, they should be regarded as quite approximate.

The pronunciations represented in the table are based on the consensus of three major dictionaries, as discussed on the FEWL page.  (The alternate pronunciations for the FEWL signature words (listed here) were not used.)  The word frequencies were taken from a word frequency list based on the British National Corpus, as documented on this site.  This list has some flaws, such as not including frequencies for contractions, but I haven't yet found anything better.  Especially annoying is the fact that the pronunciations are American, while the word frequencies are based on British sources.  Even with these caveats, I believe these tables to be accurate enough in their broad outlines.

Each table has five columns, arranged differently from one table to the next.  The column headed "Trad Spelling" contains a phonogram from traditional English spelling, or the symbol ~ for the occasional cases where a pronounced sound is omitted from a word's spelling.  Some phonograms end with the notation _e, which indicates that the symbol is followed by a single consonant (except when _e follows the letter r, in which case the r is the consonant) and a silent e.  The column headed "FLOSS (Phonemic)" contains a representation of a sound in the FLOSS spelling system.  There are a few extensions to FLOSS for this purpose, described in the next paragraph.  This column also uses the symbol ~, to indicate a phonogram with no corresponding sound, that is, one or more silent letters.  The column headed "Unweighted frequency" contains the number of times the phonogram/sound pair occurs in the FEWL word list.  The column headed "Weighted frequency" contains a scaled indication of the frequency of the pair in the BNC frequency list.  The figure listed is the number of occurences divided by 16 (which makes the weighted and unweighted frequencies somewhat comparable).  Pairs whose frequencies display as - do not occur in the BNC list.  The column headed "Example" contains one or more sample words for the phonogram and sound, selected at random by the programs which generated the tables.  Because the words were selected randomly, some of them may be unfamiliar or afflicted with multiple pronunciations.  This is regrettable, but a more hands-on selection of example words didn't seem practical.

The FLOSS spelling in the tables is augmented in the following ways.

  1. FLOSS uses the symbol $ to indicate the plural ending, regardless of its pronunciation.  The tables use the spelling $ when pronounced /z/, and the spelling ß when pronounced /s/.

  2. FLOSS uses the symbol þ to indicate the past tense ending, regardless of its pronunication.  The tables use the spelling þ when pronounced /t/, and the spelling ð when pronounced /d/.

  3. The notation =#, where # represents any letter, indicates an initial letter, pronounced as the letter name, as in the words T-shirt and Xmas.

Note that FLOSS makes some non-phonemic distinctions, based on stress, morphemics and the corresponding British (RP) pronunciation.  I believe this makes it more, not less, useful in this context.  A previous version of these tables also treated the vowel sound of new different from that of crew, based on the distinction in British pronunciation.  I've concluded that making this distinction was more confusing than helpful, and so it has been removed.

Certain very common words (a, an, and, for, of, to, into) were assumed to have their weak (unstressed) pronunciation.  The word the is represented in both its strong and weak form.  Because these words are so very common, the effect is to substantially increase the frequency of schwa spellings compared to those that would result if the strong forms had been assumed.

The word frequency data from which these tables were compiled can be downloaded here.  The original BNC data has been rearranged and manipulated to a certain extent, with presumably some loss in accuracy.

