You can see the actual code file here.

import string
import random
from collections import defaultdict

Removes the punctuation from a given string

def remove_punction(text):
    return text.translate(str.maketrans("", "", string.punctuation))

Generates the trigrams of an array of elements. For example, if data = [a, b, c, d] then the output will be [[a,b,c], [b,c,d]].

def get_trigrams(data):
    for i in range(len(data) - 2):
        yield data[i:i + 3]

Generates the n-grams of an array of elements. For example, if data = [a, b, c, d] and degree = 2 then the output will be [[a,b], [b,c], [c,d]]. This is just a generic version of the get_trigrams function, here for reference.

def get_ngrams(degree, data):
    for i in range(len(data) - degree + 1):
        yield data[i:i + degree]

Given a dictionary of keys and weights, choose a random key based on its weight.

def weighted_choice(states):
    n = random.uniform(0, sum(states.values()))
    for key, val in states.items():
        if n < val:
            return key
        n -= val

Something went wrong, don't make a choice.

    return None

if __name__ == '__main__':

Load some data from a text file named data.txt Open the file containing the data, and read it into a variable. The file might look something like:

WORK IN PRAGUE\n Low comedy's got the new shine.\n Defiant luster.\n The three film score.\n No claims between updates and rehearsals.\n Always admired.\n

    file = open("data.txt")
    data = file.read()

Clean up the data by removing punctuation and replacing newlines (\n) with spaces. Cleaning is typically the hardest part, since you'll encounter a lot of weird stuff out in the wild.

    data = remove_punction(data).replace('\n', ' ')
    parsed_data = data.split(' ')

Create the Markov Chain. I use defaultdict as a simple (error-free) counter. This just means that every key in the dictionary has a default value. If a key isn't in the dictionary then it will automatically have a value generated. The lambda: defaultdict(int) means the default value will be another dictionary which sets the default values to zero.

So if you wrote markov['test'] you would get a defaultdict. Going one step further, you could write markov['a']['b'] which would result in 0. This let's you write a counter in the form markov['a']['b'] += 1 without getting a KeyError.

    markov = defaultdict(lambda: defaultdict(int))

Now split the input data into trigrams for parsing.

    for (w1, w2, w3) in get_trigrams(parsed_data):

Use lowercase to normalize the data.

        w1 = w1.lower()
        w2 = w2.lower()
        w3 = w3.lower()

Update the tally for this combination.

        markov[(w1, w2)][w3] += 1

You can use the following code for a preview of the Markov Chain probabilites: for state in markov: print('%r => %r' % (state, markov[state]))


Use random to pick a random initial state (or set it yourself).

    start_state = random.choice(list(markov.keys()))

You can preview it by writing print("Initial State: %s" % repr(start_state))


Now start 'walking' along the Markov Chain to generate stuff. Unfortunately, the hard part is knowing when to stop. Here I say I want sentences no longer than 15 words, and if there's a dead-end, stop immediately. Naturally, there's much better ways to do this.

    s1, s2 = start_state
    max_length = 15
    result = [s1, s2]
    for i in range(max_length):
        next_state = weighted_choice(markov[(s1, s2)])
        if next_state is None: break
        s1 = s2
        s2 = next_state

Now we convert the data to regular string.

    generated_text = ' '.join(result) + '.'
    print("Generated Text: %s" % generated_text)

Enjoy! Sometimes Markov Chains are great, sometimes they are finicky, and you probably need a lot of data to get coherent/good results from them. They're definitely a good starting place for content generation, though.