AI or ain't: Markov Chains

Eliza was an impressive chatbot of its era. But all its success was hidden in a carefully crafted set of rules that made it sound like a therapist. Markov chains are more flexible, as they don’t rely on strict guidelines, but rather use patterns and statistics to generate responses. Let’s explore how Markov chains work and how to build one yourself.

Markov chains

A Markov chain is a mathematical model that describes a sequence of events where the outcome of each event depends solely on the state of the system at the previous event. They are often used to generate text based on the probability of transitioning from one word to another.

One of the most well-known examples of this technique is Mark V. Shaney, a bot that posted various messages in newsgroups in the late 1980s. Created by Rob Pike et al, it was trained on a corpus from Tao Te Ching (classic Taoist texts) and produced messages like this one:

People, having a much larger number of varieties, and are very
different from what one can find in Chinatowns accross the country
(things like pork buns, steamed dumplings, etc.) They can be cheap,
being sold for around 30 to 75 cents apiece (depending on size), are
generally not greasy, can be adequately explained by stupidity.
...
So I will conclude by saying that I can well understand that she might
soon have the time, it makes sense, again, to get the gist of my
argument, I was in that (though it's a Republican administration).

It sounds very much like English and with a certain degree of imagination might even make sense. Obviously, after some cherry-picking one may create a fairly meaningful text generated by a Markov chain.

Building a Markov chain

Believe it or not, a typical implementation of a Markov chain is around 30 lines of code including both, a training part and a text generator.

We start with training. Reading the training text line by line (assuming that each line is a sentence of its own) we split that into tokens (words) much like we did in Eliza, and start processing those tokens in order.

For the given prefix word we should store the following word in the chain. Also it would be convenient to put the first word in the chain into a possible sentence starter list (otherwise generated text may start mid-sentence and give away the fact that our chain is not that intelligent at all):

func main() {
  chain := map[string][]string{}
  starts := []string{}
  scanner := bufio.NewScanner(os.Stdin)
  for scanner.Scan() {
    words := strings.Fields(scanner.Text())
    if len(words) < 1 {
      continue
    }
    starts = append(starts, words[0])
    for i := 0; i < len(words)-1; i++ {
      chain[words[i]] = append(chain[words[i]], words[i+1])
    }
  }
  fmt.Println(starts)
  for k, v := range chain {
    fmt.Println(k, v)
  }
}

// Let's use the following text from Margaret Atwood as an input:
//
// She's too young, it's too late, we come apart, my arms are held, and the
// edges go dark and nothing is left but a little window, a very little window,
// like the wrong end of a telescope, like the window on a Christmas card, an old
// one, night and ice outside, and within a candle, a shining tree, a family, I
// can hear the bells even, sleigh bells, from the radio, old music, but through
// this window I can see, small but very clear, I can see her, going away from me,
// through the trees which are already turning, red and yellow, holding out her
// arms to me, being carried away.
//
// The only possible starter would be "She's".
// Chain would contain around 80 prefix+candidates pairs:
//
// a       [little very telescope, Christmas candle, shining family,]
// already [turning,]
// an      [old]
// and     [the nothing ice within yellow,]
// apart,  [my]
// are     [held, already]
// ... 
// little  [window, window,]
// ...
// very    [little clear,]
// ...

This means that “a” may be followed by words “little”, “very”, “telescope” etc. “Very” may be followed by “little” or “clear”. “Little” may be followed by “window”.

Proper Markov chains store probabilities for each candidate word, but we store duplicates of words to keep it simple. This should have the similar effect: in a list of candidates ["A", "A", "A", "B"] a random word would have {"A": 0.75, "B": 0.25} probability.

Training is done. Now, the generator part is also rather straightforward:

w := starts[rand.Intn(len(starts))]
fmt.Print(w, " ")
defer fmt.Println()
for {
    candidates := chain[w]
    if len(candidates) == 0 {
        break
    }
    next := candidates[rand.Intn(len(candidates))]
    fmt.Print(next, " ")
    w = next
}

We pick a random starter word, print it, find all candidates, pick one randomly, find candidates for that word, pick one randomly and keep doing so until we find a word with no candidates (indicating the end of a sentence). Such an algorithm produces text like this:

She’s too young, it’s too late, we come apart, my arms to me, being carried away.
She’s too young, it’s too young, it’s too young, it’s too late, we come apart, my arms are held, and ice outside, and nothing is left but through this window on a shining tree, a Christmas card, an old music, but very clear, I can hear the wrong end of a family, I can hear the radio, old one, night and yellow, holding out her arms are already turning, red and yellow, holding out her arms to me, being carried away.

For a chain trained on just one sentence it’s not too bad. How about training it on a larger text, say all of the Simpsons quotes?

You coulda been handed down. And one hundred spaceships...
Sure, boy. I don't have now! No cameras.
Oh, dammit! Run for him! Let's watch. I-I-I-I was just don't forget it. Time for the pull the Springfield Elementary. She's the stomach?
Mr. Simpson? I quit! I have directed three time has eaten an orphanage?
You know, whatever. Just wonderin' if he is secure.
Wow. Mental Patient, Hillbilly, or a terrible trouble. That's a little boy oh boy. Eww. I killed you. I am so much nicer table pockets, and probed, not thank God! Oh gee it's a grandmother?

Not very impressive. But we can make it better if we increase the prefix length. Currently, the chain tracks transitions from one word to another, which often guides it off track. But if we start using two consequent words as a prefix - the text quality should improve significantly.

Let’s make the prefix length variadic and try again:

func main() {
  n := 2 // prefix length
  chain := map[string][]string{}
  starts := []string{}
  scanner := bufio.NewScanner(os.Stdin)
    // Training
  for scanner.Scan() {
    words := strings.Fields(scanner.Text())
    if len(words) < n {
      continue
    }
    words = append(words, make([]string, n)...)
    starts = append(starts, strings.Join(words[:n], " "))
    for i := 0; i < len(words)-n-1; i++ {
      prefix := strings.Join(words[i:i+n], " ")
      chain[prefix] = append(chain[prefix], words[i+n])
    }
  }
    // Generator
  w := starts[rand.Intn(len(starts))]
  fmt.Print(w, " ")
  defer fmt.Println()
  for {
    candidates := chain[w]
    if len(candidates) == 0 {
      break
    }
    next := candidates[rand.Intn(len(candidates))]
    fmt.Print(next, " ")
    w = strings.Join(append(strings.Fields(w)[1:n], next), " ")
  }
}

Here’s the text produced with different prefix lengths trained on “The Simpsons” corpus:

N=1:
> You see, I think a phone. Then how Democrats sold the phonies -- can we lost the following the luck! They couldn't have a Gamecock!
> Uh, uh, I gotta record for my life. Warmest regards, Love me give a veterinarian and maybe he'll never drank my ass. I'll cook dinner? Lasagna would come along with the cartoon. Jack and push some ducklings.
> That's why did you your expense!

N=2:
> Give me back my belly for four-and-a-half-months. When you call me a bigger ball of yarn. That's funny, I prefer a dead boyfriend!
> In the last time for this. I'm not dead, idiot.
> Yeah, someone really kill a man like me. My name is Stacy, but you wouldn't be teachin' there.
> Homer! You promised!

N=3:
> I always keep a Bible close to my Shauna, if someone so much as peels a ladybug decal off her fake fingernails, I'm blaming you.
> I shouldn't have put so much syrup on the pancakes.
> Now, guess how much of this is true!

As the length of the prefix grows - the text becomes more meaningful, but also it often repeats the exact phrases from the training text. For English language the length of 2 is considered being a good compromise between creativity and meaningfulness.

Trying N=2 on a corpus from Paul Graham essays we might get results like this:

And whereas Wikipedia’s main appeal is that it’s not a sufficient one.
You can change anything about how to choose problems as well as being bad in what must have been sent packing by the idea.
He’s not just within existing industries that spiked the sharpest before the underlying principle.
It was small compared to what he does well and the resulting vacuum will add force to sell early on.

All of the code from this post can be found on GitHub.

AI?

Due to its simplicity, Markov chain model is often used in simple applications such as name generators, word prediction for keyboard input, even PageRank algorithm of Google was a variation of a Markov chain.

Although Markov chains show certain progress compared to Eliza, they are still stochastic parrots. However, the term “AI” (artificial intelligence) often goes hand in hand with “ML” (machine learning), and Markov chain is an example of the most basic machine learning algorithm: it can teach itself some knowledge and use it later. In the next part we’ll look at some more advanced ML, such as neural networks.

Next part: Neural Networks

I hope you’ve enjoyed this article. You can follow – and contribute to – on Github, Mastodon, Twitter or subscribe via rss.

Jan 02, 2024