2014-05-07

ngraph: another random text generator.

In my series of scripts to generate random text for fun and for helping create secure passwords this is my latest.  This new set of scripts generate text based on frequency weighted random choice of ngraphs.  I use it sometimes to generate word-like things  to create passwords.  There may be other uses, for instance creating random text with English characteristics.

ngraph:  Generating text from frequency weighted letter combinations

An ngraph is a group of n consecutive letters occurring in a language.  (This is my definition, there may be another word for it but I couldn't find it.) A set of ngraph frequencies is a set of the number of times each ngraph is used in a group of texts.  So for instance for n=1 we have the frequencies of the letters.  For n=2 we have the set of frequencies of the digraphs, for n=3, the trigraphs etc.  Because it is easy to do on a computer, I include some punctuation (space return - ' , . ; ! ? &).  I use these files among others:
The complete works of Jane Austen and the complete works of William Shakespeare.
Any texts would do.  Gutenburg texts have a few oddities that the script is designed to work with.

There are two main scripts and a subsidiary script.  (ngraph.pl, dbfill.sh and ngraph-db.pl)

The first script is "ngraph.pl": this script has two separate functions.
Firstly it reads a series of files with presumably text in a human language and generates a set of ngraphs for an "n" you specify.
Then ngraph.pl will generate a random set of text based on the ngraphs and their frequency.

Alternatively it can just output the set of ngraph frequencies as text or sql.  The reason for this is that reading the files and creating ngraph frequency tables is a resource intensive process so I decided to create a database of ngraphs and to generate text from that database.  I found a database with n = 1 to 5 to be most useful and above 5 the amount of data gets massive and more actual words are generated. 

The second script "dbfill.sh" is a subsidiary script.  It creates the database and populates it with ngraphs using the first script.

The third script is "ngraph-db.pl"
This uses the database and generates text based on the ngraphs in the database.  Because it has access to a database with ngraphs of say n = 1 to 5 it can generate text from random sized ngraphs as well as a single ngraph.

The generated text can include words but it mostly has word-like things that are a bit memorable but not actual words.  I never use the generated text directly to create passwords but pick and choose bits and let parts of the text inspire a password.

The scripts are available under the GPL here.

Here is a sample output:
$ngraph-db.pl -W -c 1000 -2 -g 1-4

1 comment:

baeyardeager said...
This comment has been removed by a blog administrator.