On Poetry

Recently with friends we talked about an early web site I’ve programmed, probably the very first web service I’ve ever done. It was a simple dictionary.

Back then, we had Nokia phones without real web browsers, just something called the WAP protocol and WML markup. Since I struggled with foreign languages as a student (I still do!), I decided to create a mobile site in WAP/WML to help me translate words, show definitions, thesaurus entries, pronunciations, and more.

I don’t remember all the details of how I built it, but I do recall that I had to process hundreds of megabytes of text quickly enough so that the web hosting company, who generously offered me free space, didn’t evict me.

Now, years later, I thought I’d check if things have changed in linguistic programming. Of course, there are plenty of tools like NLTK these days, but in this article, we’re going to build a poet’s assistant from scratch, without any third-party code or libraries, to see how coding a dictionary looks like.

Thesaurus

Of course, every poet needs to find the most suitable, worthy, acceptable, fitting, proper, right word from time to time. For that, they need a thesaurus. A quick search led me to the Moby Project.

They have an impressive file called mthesaur.txt which is “the largest and most comprehensive thesaurus in English in public domain”.

After downloading 23MB of text, let’s take a peek at its contents:

...
cavern,abri,antre,bunker,burrow,cave,cove,dugout,foxhole,grot,grotto,hole,lair,sewer,subterrane,subterranean,subway,tunnel,warren
...
magenta,amethystine,lavender,lilac,livid,mauve,mulberry,orchid,pansy-purple,plum-colored,plum-purple,purple,purplescent,purplish,purply,purpurate,purpure,purpureal,purpurean,purpureous,raisin-colored,violaceous,violet
...

Is that it? Every line is a list of words, separated by commas? Yes. With that, here’s a quick Python thesaurus tool:

import sys
thesaurus = {w[0].lower(): w[1:] for w in (line.split(',') for line in open('mthesaur.txt'))}
print("\n".join(thesaurus[sys.argv[1].lower()]))

It’s almost a one-liner. The script reads the file line by line, splits each line, and stores the first word as the key and the rest as synonyms in a Python dictionary. Then, it looks up the requested word and prints a list of all the synonyms found:

$ python3 thesaurus.py sunbeam | column
broad day       daylight        dusk            midday sun      shine           sunlight
dawn            dayshine        full sun        noonlight       sun spark       sunshine
day             daytide         green flash     noontide light  sunbreak        twilight
day glow        daytime         light of day    ray of sunshine sunburst

Pretty satisfying. But then, this happens:

$ python3 thesaurus.py zebra
...
KeyError: 'zebra'

Sure, “zebra” is a word, and it probably has a few synonyms?

$ grep zebra mthesaur.txt | cut -d ',' -f1 
animal
antelope
...
crossbreed
...
horse
...
pig
platypus
...

I’m no zoologist, but of these seem like very distant cousins to a zebra. Still, this shows that “zebra” is in the thesaurus.

The issue is that in this database, a word “A” may have “B” as a synonym, but not every word “B” necessarily points back to “A.” This becomes clear when you realize the database has 30,259 records but contains 103,306 unique words. Most words have 20-40 synonyms, so if we tried to fully interlink all synonyms, the file would balloon in size. For example, if we connected every word to all its occurrences in other records, we’d end up with a dictionary of nearly 2GB! However, if we only reconnect each word to the first word of any record it appears in, the dictionary grows to just 34MB instead of 24MB — not too bad, considering that now it includes a “zebra”!

Alternatively, we could avoid modifying the database and instead scan it entirely to find a word, whether it appears at the start or anywhere within a record. For a 23MB file, this shouldn’t take long on modern machines.

Dictionary

One of the most well-known databases for word definitions, semantic relationships, and usage examples is WordNet. It’s structured into a set of text files, each dedicated to a specific word category: nouns, verbs, adjectives, and adverbs. These files contain metadata (lines that start with whitespace) and “synsets” (sets of synonyms that share the same meaning).

Each synset is structured like this:

synset_offset  lex_filenum  ss_type  w_cnt  word  lex_id  [word  lex_id...]  p_cnt  [ptr...] [frames...]  |  gloss

What we care about are:

Each synset can contain multiple words sharing the same meaning, so we need to check the w_cnt field (a hexadecimal number) to know how many words are listed in that synset, since the word we would be looking for may not be the first one there.

Here’s a Python script that parses WordNet data and prints the definitions and usage examples for a given word:

import sys

with open("wordnet.txt") as f:
    for line in f:
        if line.startswith(" "): continue  # ignore dictionary metadata
        parts = line.split(" | ")  # Split gloss from the rest of the data
        meta = parts[0].split()  # The first part contains record metadata
        gloss = parts[1].strip() if len(parts) > 1 else ""  # Gloss (definition)
        w_cnt = int(meta[3], 16)  # The word count is in hexadecimal
        for i in range(w_cnt):
            # Every second field after the 4th one is a word
            word = meta[4 + 2 * i].lower()
            if word.lower() == sys.argv[1].lower():
                lines = [l.strip() for l in gloss.split(";")]
                print("\n".join(lines))
                print()

This script is quite straightforward: it reads the file line by line, skips the metadata, splits the useful data from the gloss, checks how many words are in the synset, and prints the definitions and usage examples if the word matches the input.

Here’s an example of how it works:

$ python3 wordnet.py format
the general appearance of a publication

the organization of information according to preset specifications (usually for computer processing)

divide (a disk) into marked sectors so that it may store data
"Please format this disk before entering data!"

determine the arrangement of (data) for storage and display (in computer science)

set (printed matter) into a specific format
"Format this letter so it can be printed out"

While this is a very basic dictionary tool, it gets the job done by displaying word definitions and examples directly from the WordNet data.

Rhymes

The last tool in our poet’s toolkit is a rhyming dictionary. A popular choice for this is the CMU Pronouncing Dictionary (cmudict), which provides phonetic transcriptions of over 100,000 words. It’s often used for speech recognition and linguistic analysis, but it also works well for finding rhymes.

The CMU dictionary comes as a simple text file where each line contains a word followed by its phonetic transcription. The original file comes with a word “Déjà” which breaks all unicode parsers, so we should either remove it or replace with a proper unicode spelling.

To extract rhymes, we should reverse the pronunciation syllables and compare the ending sounds of words. This allows us to find exact rhymes like “house” and “mouse” but not false matches like “moose.”

Of course, there are many more types of rhymes: we can only compare vowels, or we can compare beginning of the words (alliteration), or half-rhymes where final consonants match.

We will be only looking for perfect rhymes. CMU dictionary marks stressed syllables with a number. Syllables ending with “0” are not stressed. Syllables ending with “1” indicate the primary stress. Syllables ending with “2” are secondary stresses. We will ignore unstressed syllables and secondary stresses, leaving only primary stressed syllables in place (so that a stressed “OH1” would match another stressed syllable, but will not match an unstressed “OH”).

Here’s how our rhyming approach can be done:

import sys

cmu = [] # a list of tuples (word, reversed syllables)

with open('cmudict-0.7b') as f:
    for line in f:
        if line.startswith(';;;'): continue # skip metadata
        word, *pron = line.split()
        # remove markers from words (some words have multiple pronunciations)
        word = word.rstrip("()123456789")
        # reverse and clean up pronunciation
        rpron = list(reversed([p.rstrip("02") for p in pron]))
        cmu.append((word, rpron))

# return the part of the pronunciation until the primary stress (including it)
def base(x):
    out = []
    for i in x:
        out.append(i)
        if i.endswith('1'):  # stop at the stressed syllable
            break
    return out

for w, p in cmu:
    if w.lower() != sys.argv[1].lower(): continue
    # once the word is found: look for rhymes
    p = base(p)
    for u, q in cmu:
        q = base(q)
        # if the "tails" of the pronunciation match and the words are not
        # identical: print a rhyme!
        if p == q and u != w:
            print(u.lower())

Running the script looks like this:

$ python3 rhymes.py house | head
blouse
boathouse
bouse
brouse
...

$ python3 rhymes.py format
doormat

$ python rhymes.py orange
(no output)

For a few lines of code we’ve written our dictionaries provide fairly good results. Of course we’ve doing a very naive search, while a proper dictionary tool would have to deal with much larger databases and would have to optimise the lookup.

For that one can use a sorted index, where all words in the dictionary are stored with their offsets in the main dictionary file. Then using a binary search one can find the right word immediately.

Or an index may only contain first two letters of a word, acting like a hashmap: we find the offset in the main dictionary by the first two letters of the word and then linearly scan the records until we find the one we are looking for.

Also could build an inverted index for synonyms so that you can quickly find words related to any given synonym. For a rhyming dictionary, an inverted index of phonemes could help quickly locate words that end with similar sounds.

Tries are also a helpful data structure for dictionaries, we can use it for auto-completing words or to look up the rhymes by storing the reverse pronunciation in a trie.

And of course there are useful algorithms for language processing, such as stemming (to cut the word to its stem, base or root form). With stemming, if you look for “zebras” it would find an article about “zebra”, which with our simplistic solution would not be found.

Linguistic programming is fun and unusual. If the topic sounds appealing to you – you may find inspiration in building an anagram finder, crossword generator, Wordle solver, word cloud generator or even a basic spell checker. Have fun!

I hope you’ve enjoyed this article. You can follow – and contribute to – on Github, Mastodon, Twitter or subscribe via rss.

Sep 18, 2024

See also: textizer: hack your android widgets and more.