AuthorsNotebook icon

Markovian

Word generation



Introduction

Markovian is a command line utility for generating fake words or names. You give it one or more lists of words and it analyses their structure, finding some of the underlying rules and then using those to generate new words.

Installation

Precompiled releases

Precompiled releases area available for macOS, windows and linux (untested). You can download the most recent release from the releases page.

If you're using windows you'll need to have the microsoft Visual C++ Redistributable libraries installed.

Compiling from source

The source code is available on github from the mikeando/markovian repository. You will need the nightly version of the rust compiler and cargo installed. Then you simply need to run

cargo build --release

Quick Start

Installing the binary

Download the release for your platform from the releases page and place it in an appropriate location.

Install a few word-lists

We'll use the Moby_Names_M_lc.txt downloadable from the Markovian repository resources and theological_angels.txt from the large list of wordlists from the data of MarkovNameGenerator.

Generate some words

First we'll generate some names from the Moby_Name_M_lc.txt file

> markovian simple generate --encoding=string --count=5 \
  Moby_Names_M_lc.txt
stergiramessey
barnan
bralph
jerrel
jord

Next we'll use both of the word lists

> markovian simple generate --encoding=string --count=5 \
  Moby_Names_M_lc.txt theological_angels.txt
harutchel
gord
waylinton
cord
benatton

We can also request a specific prefix and/or suffix

> markovian simple generate --encoding=string --count=5 \
  Moby_Names_M_lc.txt theological_angels.txt \
  --prefix=jo
joenjamey
jophield
jon
jos
johnaterrius
> markovian simple generate --encoding=string --count=5 \
  Moby_Names_M_lc.txt theological_angels.txt \
  --suffix=io
alio
meodovio
hatrizio
antonio
allio
> markovian simple generate --encoding=string --count=5 \
  Moby_Names_M_lc.txt theological_angels.txt \
  --prefix=jo --suffix=io
johansalvandrio
jonancio
jorrizio
josidarrio
jonancilio

Speeding it up

On large word-lists the `simple generate` command can be slow, as it needs to process a lot of data. We can make the word generation run very quickly by precomputing everything it needs.

Conceptually creating the preprocessed data has three steps.

  1. Determining the input symbols.
  2. Identifying and combining symbols that occur together.
  3. Building triplet map

Generating the initial symbol table

The first stage is

markovian symbol-table generate --encoding=string \
  --output=A.symboltable --input=word-list-1.txt --input=word-list-2.txt

This generates a symbol table file called A.symboltable containing all the letters from the two input word lists --- you'll want to use a better name for the output.

You can see the list of symbols it uses

markovian symbol-table print --input=A.symtable

For example

> markovian symbol-table generate --encoding=string --output=Moby_initial.symboltable --input=Moby_Names_M_lc.txt 
using 3878 input strings
found 30 symbols
wrote Moby_initial.symboltable 
> markovian symbol-table print --input=Moby_initial.symboltable
encoding: char
max symbol id: 30
0 => START
1 => END
2 => a
3 => r
4 => o
5 => n
...
24 => x
25 => z
26 => v
27 => '
28 =>  
29 => q

Combining symbols

This step works on an existing symbol table file and looks for symbols that occur together frequently in the input and combines them into one compound symbol.

markovian symbol-table improve A.symboltable --output=B.symboltable word-list-1.txt word-list-2.txt

For example

> markovian symbol-table improve Moby_initial.symboltable --output Moby_50.symboltable resources/Moby_Names_M_lc.txt 
...
> markovian symbol-table print --input=Moby_50.symboltable
encoding: char
max symbol id: 80
0 => START
1 => END
2 => a
3 => r
...
29 => q
30 => er
31 => ar
...
75 => em
76 => ab
77 => do

We can then see how this symbol-table breaks up words using

> markovian symbol-table symbolify --symbol-separator="." Moby_50.symboltable johnathon stephan arnold eric
johnathon => ["j.o.h.n.a.th.on"]
stephan => ["st.e.p.h.an"]
arnold => ["ar.n.ol.d"]
eric => ["er.i.c", "e.ri.c"]

We only show the shortest symbols that produce the given word, but it is possible that more than one combination can produce the same length - in the example above eric can be written two ways.

At the moment this performs a fixed number (50) of symbol combining steps. This will become configurable in the future.

If you want to combine more symbols you can rerun this stage on the new symbol table file.

Generating the triplet maps / generator

We create the triplet maps / generator file using

> markovian generator create B.symboltable --output=A.generator word-list-1.txt word-list-2.txt

Generating words with the generator

markovian generator generate A.generator

You can add --prefix=prefix, --suffix=suffix and --count=N to this too.

Reference

Details about all commands are available using markovian --help

Contact Me

You can contact me with feedback or issues through the issues page on github.

Word Lists

If you know of any other great word-lists I'd love to add them here.