Analysing the Bible
The computer is a good tool in many areas but within its defining field, computations, it is great. With over a million computations per second even a big, large and heavy book (in its physical manifestation) can be sorted in just a blink of an eye. A while ago I tried to sort the King James version of the Bible.
Inspiration
During the last few years you may have encountered Jonathan Feinberg’s Wordle. This visualisation of word frequency in text has been popular in conveying writing patterns, showing established key terms, especially from texts where users have been expressing them selves in just single words (describe this BRAND with five adjectives). Which words we use when we express ourselves are important, the statistical frequency can give us an indication of important topics, trends, values etc, it can also convey how languages change over time.
The Process
I chose to apply a relatively new language, both in the terms of computer history and my computer skills: Python. Python is a flexible language, which is said to come with “batteries included”, in other words, much functionality is available in the standard library. Python does also come with a live interpreter and many different frameworks are supported through portations. The logic of my little program is quite easy. It can very crudely be divided into five steps: 1) read the text file 2) for each word create if no previous occurrence is found or iterate counter 3) sort the occurrences according to the frequency 4) print the total numbers of words with frequency and word, separate frequency and word with comma and words with newline.
[cc lang=”python”]
#!/usr/bin/python
from string import maketrans
import operator
import sys
if len(sys.argv) <2:
print “Error: Please provide a textfile as argument”
sys.exit(1)
else:
textfile = sys.argv[1]
words = {}
outtab =” ”
intab = “,.;:#[]()?!0123456789&<>-‘\n\t\””
transtab = maketrans(intab, outtab)
try:
linestring = open(textfile, ‘r’).read()
linestring = linestring.translate(transtab).lower()
items = linestring.split(‘ ‘)
except Exception:
print “Error: Could not open file.”
sys.exit(1)
for item in items:
if item in words:
words[item] = words[item] + 1
else:
words[item] = 1
sorted_words = sorted(words.iteritems(), key=operator.itemgetter(1))
f = open(textfile+”out.txt”,”w”)
t = open(“testfile.test”,”w”)
for k, v in sorted_words:
print k,v
t.write(k+” “+str(v)+”\n”)
f.write(k+”,”+str(v)+”\n”)
print “The total amount of words in “+ textfile + ” is “+str(len(words))
[/cc]
The code is more complex than the five steps explain above. The code gets the file-path to the text from an argument following the program name in the terminal, and it does also print simple error messages in case anything should not work.
Findings
The Swiss linguist Ferdinand de Saussure (1857 – 1913) divided language into langue and parole, French for language and speech, but where the first is the impersonal, social structure of signs, and the latter the personal phenomenon of language as speech acts. An example can be found in the game of chess. The simple structures defining the rules of the game can easily be understood, but the usage of these rules is what gives the game its complexity. Let us use this distinction while analysing the outfile of the program above.
Parole: The Bible is an interesting text. The last two thousand years the book has been taken for law and a life guide for many millions of people, and even today religious texts are used as legislation in a few countries in the world, and as a rule for how some live and organise their lives. The whole tradition of hermeneutics began with the study of interpretation of religious texts, and also wars have been fought over the analysis and the subsequent execution of actions described explicitly or implicitly. Our little test does not rely on semantic interpretation, but see what you will interpret from these words:
Love: 318
Hate: 87
Jesus: 990
God: 4531
Satan: 57
Jerusalem: 816
Langue: When Samuel Morse tried to make an efficient language for transferring messages over the wire in the 19th century, he looked to the English language and its use to find out how a message can be sent efficiently. To do this he went to typographers to see of which font cases they had the most. The morse language (getting so popular that we today can use it as a generic name) is constructed with a short dot corresponding to ‘e’, and a long dash corresponding to ‘t’. These are the most frequent letters in the English language. So how to write ‘z’ or ‘y ‘, letters that are less frequently used? ‘Y’ is dash-dot-dash-dash, and ‘z’ is represented by dash-dash-dot-dot. You may at this point guess what the most frequent occurrences of this little program brought. Here is the 20 most frequent words used:
them,6514
him,6695
not,6727
is,7119
be,7188
they,7490
lord,7990
a,8438
his,8563
i,8868
unto,9041
for,9130
shall,9851
he,10517
in,12891
that,13229
to,14048
of,35312
and,52167
the,64926
Some of the largest occurrences are removed since they had no semantic value. Before sorting and processing several characters were replaced with whitespace and everything was lowercased.
This shows us that the most frequent words are in fact the small words having a more structuring function: preposition, articles, conjunctions. We can also see that the world ‘lord’ is on the “top-20” list, and this may be related to the subject role the lord plays in many biblical sentences e.g. the lord said, the lord told etc.
Program-wise is there still potential for improvement in the program I wrote. It seems to be a parsing error causing a small group of the occurrences to be printed in a not standard format. They are written with a comma before the words.
Yesterday I received the book Visualizing Data by Ben Fry, one of the creators of Processing, so hopefully I will get some visual representations of data up and running soon.
If you want a copy of the counted and sorted file, that can be found here.
The Article Picture is named Bibles, and is the property of GeoWombats. The picture is licensed with Creative Commons and acquired through Flickr. Please refer here for more information.