Methodology
Search engines hold in their database the sum of everything that is being written on the Web, and, in the next future, with the ever increasing development of surveillance technologies and global pervasiveness, they might just be able to store everything that is being said by human beings.
The question is then what to do with this huge amount of information. The exploitation of this information provided one of the most impressive post-911 IPO, namely Google's, thanks to its famous Adwords system, in 2004, after what they became the most famous brand in the world. It is done thanks to extremely sophisticated mathematical tools.
Recentely there have been huge breakthroughs in graph theory, like the theory of scale-free networks by the mathematician Albert-László Barabási around the year 2000. Since then, scientists have been using these revivified theories in the framework of language and quantitative linguistics, renewing a field that had not made much progress since the advent of information and communication theory by Claude Shannon in the 1950's.
Step 1: Extracting massive amount of textual information from Google
The Dadameter deals with language at large scale and it needs some of these information Google has in her belly. More precisely, what we are looking for are the number of pages that contain two different words, for any two words. These numbers allow to compute the correlations between two different words, ie. the probalility to go from one word to the other if we consider language as a global hypergraph of words and pages. These numbers are actually given by Google as shown in the picture.
We start by building a list of words. For the present case we started with a list of about 7000 current english words (and also some non-english as well as a few surnames). We have then to extract these figures for about 2.5 millions couples of words (7000 x 7000 / 2). Since we were mainly interested in words that look like a bit each other, this quantity is reduced to about 800 000 couples
Step 2: Analyzing this information using recent breakthroughs in the field of graph theory in order to understand the large scale structure of language in terms of homophony and semantics
The information extracted from Google as described above allow us to classify our couples of words according to semantic relatedness. As you will see there are mathematical formulae that allow to do that in the framework of graph theory. But first, remember that our starting point is Raymond Roussel's work and let's have a look at the other aspect of the question: homophony, which is quite easier and well known.
1 - in terms of homophony: Damerau-Levenshtein distance
The question of homophony within information theory has been investigated since the 1950's. Damerau-Levenshtein distance allows to measure the amount of difference between two sequences (written words) and therefore is a measure of their homophonic resemblance (not all the question of homophony is described by this quantity since it stays on the level of written letters and not real phonetics but it will be enough for us and matches quite well with the method of Roussel - although he also made a broader use of homophony). For instance the Damerau-Levenshtein distance between «billard» and «pillard» is low (it is 1 according to the definition below). The Damerau-Levenshtein distance between «kitten» and «sitting» is 3, since these three edits change one into the other, and there is no way to do it with fewer than three edits:
kitten --> sitten (substitution of 's' for 'k')
sitten --> sittin (substitution of 'i' for 'e')
sittin --> sitting (insert 'g' at the end)
Precisely, In information theory and computer science, Damerau–Levenshtein distance is a «distance» (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two characters.
A first part of our work was then to compute all the Damerau–Levenshtein distances between the 7000 x 7000 / 2 couples. That was the easy part.
2 - in terms of semantic relatedness: Google Similarity distance
Very recentely there have been important progress in the field of graph theory and quantitative linguistics. A recent paper from 2005 by Rudi L. Cilibrasi and Paul M.B. Vit´anyi, has defined a (pseudo-)distance called Google Similarity Distance (or Normalized Google Distance) that reflects the semantic relatedness of two words, using information extracted from search engines (cf Step 1). The distance between two words is smaller when both words are closely related, ie. when they tend to appear more often in the same webpages.
This Normalized Google Distance is actually an improvement of a simpler and more understandable formula defining the similarity between two words a and b:
s(a, b) = f(a,b) / sqrt (f(a) x f(b)) = Number of pages containing both a and b divided by the squareroot of (Number of pages containing a × Number of pages containing b).
Then d(a, b) = 1 — s(a,b) is the disimilarity between both words.
This distance depends on the figures that we have extracted from Google. After having scanned about 800 000 pages of search results, we were able to get all these quantities for our 7000 words and 800 000 couples of words. Results are impressive and seem to match quite well with semantic intuition, as it is argued in the paper by Cilibrasi and Vit´anyi.
3 - in terms of equivocation: clustering coefficient
An even more complex concept is the clustering coefficient on a graph. It is interesting for our study because on a semantic graph, it helps at defining the degree of polysemy of a word, or equivocation (as it is used by Roussel in his method). Intuitively if a word belong to two clusters it tends to have two different meanings. If a word belong to only one cluster, its meaning will most certainly be univocal.
Recent papers have studied clustering coefficient on a semantic network, by relating it to the graph curvature. In our case we had to consider the problem of an hypergraph and we used in fact a paper from 2007 dealing with weighted graphs. Technically we take the minimum of this coefficient for a given couple of words.
Here, results become very difficult to check and to interprete. This coefficient is function of the entire neighborhood of a word and calculating its full value was just impossible for us. We still lack a lot of information, and we had to make some approximations whose validity is not very well settled yet. However we expect that our study shed some lights on the subject.
At least qualitatively it was important to take this aspect into account since it is part of Roussel's method and since it allowed us to conceptualize our approach in terms of map in an unexpected and interesting way. If you look at the Dadamap section, the Utilitarianism zone and the Equivocation line, are directely related to this clustering coefficient (or curvature) approach.
Step 3: Visualizing these structures with maps, graphs or global indexes and interpreting the results
We are now in possession of huge lists of numbers and we need to interprete and visualize this information. This is a diffcult problem but not hopeless. There are two basic technics here: self-organizing maps and graphs visualization.
1) A self-organizing map map is a standard and very helpful technique used for visualizing low-dimensional views of high-dimensional data by preserving the topological properties of the high-dimensional initial space. It is generated with a neural network that gathers together pixels wich are of similar colours. The model was first described as an artificial neural network by the Finnish professor Teuvo Kohonen, and is sometimes called a Kohonen map. Here is an example of a generation of such a map with a very small set of couples. At the beginning, coloured pixels are randomly picked and they organize themselves. In the Dadamap section you will discover our interpretation of the different zones, which is the main result of our study.
2) Finally our data have been used for building an interactive graph. It is described in the R.R.Engine section.