Wheeee programming time

If you haven’t noticed already, I really like portmanteaus.  What probably isn’t yet apparent is how much I like making silly programs.  Recently I’ve stumbled on a bit about Markov chains, and decided to rework my old word-generation algorithm that I adapted from a description on a video game forum.  To give you an idea of what this does, here is a list of words I’ve generated from Shakespeare’s Hamlet.  I’ve made no effort to dress this up; these are the first 30 words generated when I ran this code.

methou

treache

temperfect

actice

therly

expectation

collate

prepardon

wonderstanding

controve

repend

ther

anothere

matrong

heave

discomfortiness

scuffs

penetrabless

importunes

meritable

wronge

liver

fathere

accordinant

conce

solding

christ

buildenstern

readed

ealers

You’ll note that a number of these words, bolded for emphasis, are actual English words that weren’t present in the original text, illustrating a remarkable tendency of this algorithm to follow normal pronunciation and word-building rules.  (No, “Christ” never appeared in the original text, though “Christian” appears a number of times.)  In other cases, such as the humorous wonderstanding and buildenstern, it’s easy to see how these words would come about via natural overlapping of syllables.  The implementation is pretty standard, but I think it’s an interesting exercise and illustrative for someone new to python.

I like to think of Markov chains as state-machines, flipping to new states with fixed probabilities.  They are a rather abstract subject, but find applications in fields as diverse as biology and finance due to their inherently chaotic nature.  One traditional application of Markov chains is generating gibberish text.  While such an algorithm may not be capable of passing the traditional Turing test, it is often used for avoiding spam filters (a so-called reverse Turing test).

In this instance, I’ve chosen to link individual letters together instead of words.  Like the words in sentences, letters are naturally grouped together in distinct, albeit interrelated units that we can (naively) call syllables.  My first attempt at this algorithm actually broke words into 2- to 3-letter segments, recorded their relative adjacencies, translated this to probabilities, and then navigated the network of “syllables” to attempt to create novel words.

However, that technique had one major flaw that kept me from developing it further.  Size-2 syllables would have little trouble forming new words, size-3 syllables would struggle to form any words at all.  This is understandable, since the word would have to have a length divisible by 3.  Combining 2- and 3-letter syllables led to the best output, although it was only marginally better than just having the small, 2-letter syllables.

The new algorithm is considerably simpler and develops much better output.  Instead of comparing groups of letters, short fragments (length based upon the order of the Markox chain, more on that in a bit) of letters were assigned lists of possible successive letters.  In this way, the new word was marched along one at a time.  I’m sure this is extremely confusing and not very well explained at this point, so let’s jump into the code.

Only a few bits of this are truly complicated.  When you instantiate a new Language object, you assign it an order.  The order is related to the size of each syllable and cannot be modified.  The algorithm will generate the most natural-sounding words for higher orders, but it also places stricter limitations on the range of output.  I’d never go any higher than 5 or any less than 2.  The words above, generated from Hamlet, were constructed using a fifth-order Markov chain.  For large bodies of text this works well; if I were looking at a list of names, for example, I’d drop it down to 3 or 4.

After assigning the order, we define two empty sets and one empty dictionary (the empty brackets).  The two sets, words and generated, keep track of words from the source and words generated from the code, respectively, so repeats can be avoided and the list of generated words can be quickly referenced.  I used sets in place of list objects since we don’t need to check the list in any particular order, and set object lookups are a little faster than lists.  However, I started out defining these as lists, since they are a little easier to work with.  I recommend anyone else do the same unless they have a strong understanding of sets and their methods.  The dictionary will connect the syllables (strings of length order) with lists of individual letters (length-1 strings) for piecing words together.

The learn and make_word functions are the most important part of this code.  The learn function is sent a line of text (I originally made the mistake of trying to process the entirety of Rudyard Kipling’s The Jungle Book in one go) and cleans it up for processing.  It is then broken up into individual words, and the words are recorded before being broken up into fragments, via a rather complicated line of code.  The fragments, composed of a syllable and one additional letter, are compared to the current list of known syllables, and the dictionary is updated to reflect the new data.  Not too bad!  I’m sure this code could be optimized further, but it’s already rather fast.  The entirety of The Jungle Book only takes a few seconds to learn.

One important note on the “fragment” code.  There needed to be some control on the “start” and “end” of words.  An extremely bad way to do this would be to adapt the code to allow different size syllables and record the starting and ending syllables separately, and then assembling a separate list of these special syllables so the code knows how to handle them.  What I’ve done instead is append a fixed number of spaces to the beginning and ending of each word.  This means there is one master “start” syllable and letters found at the beginning or end of words are padded by spaces.  It’s a really simple solution, and it allows the algorithm to have fixed starting and stopping positions without adding any additional logic.

You’ll note that the dictionary containing the syllables will accumulate a lot of redundant entries.  This is intended.  Since we will make a random selection from this list, more common entries will be represented in their respective proportions.  This could create some issues with memory, but I’ve yet to encounter any serious slowdown even when I’ve loaded multiple source texts.  It also means that you can “force” certain sounds to be more common by learning the same source multiple times.  I used this to combine a short list of Aztec names with the top 1000 boys’ names for 2010.

Once a list of syllables has been built up, it’s time to start building the words.  The make_word function should be pretty straightforward if you’ve been following up to this point.  The word starts with a string of spaces of length order, as you’ll recall is the fixed “start” syllable.  A loop then add letters one-by-one, using the list of syllables and a random choice function, until the “end” syllable is reached.  It then checks that the word is unique, and if not, it repeats the process.  

There are a few options and helper functions for easier input and output; they don’t require any explanation.  I think the main application of this sort of algorithm is in procedural generation of game worlds.  A few projects plan to use such a method for creating realistic (and culturally appropriate) region and NPC names.

I recommend checking out Project Gutenberg for source material, if you want to try making words.  Lists of popular baby names also make for good source material.  I’ve even used a list of all the Pokémon.