Dadameter
Christophe Bruno
Dadameter ,Coworkers & Funding
It was coproduced by the Rencontres ParisBerlinMadrid 2008 for contemporary art and new cinema. Concept: Christophe Bruno  programming: Valeriu LacatusuA project by Christophe Bruno (2002  2008)  Programming Valeriu Lacatusu
Description
Global index of the decay of the aura of language, the Dadameter aims at measuring our distance to Dada. It was inspired by the work of the french writer Raymond Roussel.
The project is a satire about the recent transmutation of language into a global market ruled by Google et al. and uses the most uptodate technologies of control to draw cartographies of language at large scale.
Imagine a world of transparency where all desires are negotiated on a global market and their risks are evaluated and optimized.
The Dadameter is a tool for the profiling of language at large scale and the historical tracking of artistic and literary movements. It brings about a new alliance between art, science and global finance.
We use uptodate 2.0 trend analysis and data visualization technologies, neural networks, graph theory or quantitative structural linguistics in order to be able to predict the next artistic craze.
The project is a satire about the recent transmutation of language into a global market ruled by Google et al. and uses the most uptodate technologies of control to draw cartographies of language at large scale.
Imagine a world of transparency where all desires are negotiated on a global market and their risks are evaluated and optimized.
The Dadameter is a tool for the profiling of language at large scale and the historical tracking of artistic and literary movements. It brings about a new alliance between art, science and global finance.
We use uptodate 2.0 trend analysis and data visualization technologies, neural networks, graph theory or quantitative structural linguistics in order to be able to predict the next artistic craze.
Keywords

aesthetics
Technology & Material
Software
Methodology
Search engines hold in their database the sum of everything that is being written on the Web, and, in the next future, with the ever increasing development of surveillance technologies and global pervasiveness, they might just be able to store everything that is being said by human beings.
The question is then what to do with this huge amount of information. The exploitation of this information provided one of the most impressive post911 IPO, namely Google's, thanks to its famous Adwords system, in 2004, after what they became the most famous brand in the world. It is done thanks to extremely sophisticated mathematical tools.
Recentely there have been huge breakthroughs in graph theory, like the theory of scalefree networks by the mathematician AlbertLászló Barabási around the year 2000. Since then, scientists have been using these revivified theories in the framework of language and quantitative linguistics, renewing a field that had not made much progress since the advent of information and communication theory by Claude Shannon in the 1950's.
Step 1: Extracting massive amount of textual information from Google
The Dadameter deals with language at large scale and it needs some of these information Google has in her belly. More precisely, what we are looking for are the number of pages that contain two different words, for any two words. These numbers allow to compute the correlations between two different words, ie. the probalility to go from one word to the other if we consider language as a global hypergraph of words and pages. These numbers are actually given by Google as shown in the picture.
We start by building a list of words. For the present case we started with a list of about 7000 current english words (and also some nonenglish as well as a few surnames). We have then to extract these figures for about 2.5 millions couples of words (7000 x 7000 / 2). Since we were mainly interested in words that look like a bit each other, this quantity is reduced to about 800 000 couples
Step 2: Analyzing this information using recent breakthroughs in the field of graph theory in order to understand the large scale structure of language in terms of homophony and semantics
The information extracted from Google as described above allow us to classify our couples of words according to semantic relatedness. As you will see there are mathematical formulae that allow to do that in the framework of graph theory. But first, remember that our starting point is Raymond Roussel's work and let's have a look at the other aspect of the question: homophony, which is quite easier and well known.
1  in terms of homophony: DamerauLevenshtein distance
The question of homophony within information theory has been investigated since the 1950's. DamerauLevenshtein distance allows to measure the amount of difference between two sequences (written words) and therefore is a measure of their homophonic resemblance (not all the question of homophony is described by this quantity since it stays on the level of written letters and not real phonetics but it will be enough for us and matches quite well with the method of Roussel  although he also made a broader use of homophony). For instance the DamerauLevenshtein distance between «billard» and «pillard» is low (it is 1 according to the definition below). The DamerauLevenshtein distance between «kitten» and «sitting» is 3, since these three edits change one into the other, and there is no way to do it with fewer than three edits:
kitten > sitten (substitution of 's' for 'k')
sitten > sittin (substitution of 'i' for 'e')
sittin > sitting (insert 'g' at the end)
Precisely, In information theory and computer science, Damerau–Levenshtein distance is a «distance» (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two characters.
A first part of our work was then to compute all the Damerau–Levenshtein distances between the 7000 x 7000 / 2 couples. That was the easy part.
2  in terms of semantic relatedness: Google Similarity distance
Very recentely there have been important progress in the field of graph theory and quantitative linguistics. A recent paper from 2005 by Rudi L. Cilibrasi and Paul M.B. Vit´anyi, has defined a (pseudo)distance called Google Similarity Distance (or Normalized Google Distance) that reflects the semantic relatedness of two words, using information extracted from search engines (cf Step 1). The distance between two words is smaller when both words are closely related, ie. when they tend to appear more often in the same webpages.
This Normalized Google Distance is actually an improvement of a simpler and more understandable formula defining the similarity between two words a and b:
s(a, b) = f(a,b) / sqrt (f(a) x f(b)) = Number of pages containing both a and b divided by the squareroot of (Number of pages containing a × Number of pages containing b).
Then d(a, b) = 1 — s(a,b) is the disimilarity between both words.
This distance depends on the figures that we have extracted from Google. After having scanned about 800 000 pages of search results, we were able to get all these quantities for our 7000 words and 800 000 couples of words. Results are impressive and seem to match quite well with semantic intuition, as it is argued in the paper by Cilibrasi and Vit´anyi.
3  in terms of equivocation: clustering coefficient
An even more complex concept is the clustering coefficient on a graph. It is interesting for our study because on a semantic graph, it helps at defining the degree of polysemy of a word, or equivocation (as it is used by Roussel in his method). Intuitively if a word belong to two clusters it tends to have two different meanings. If a word belong to only one cluster, its meaning will most certainly be univocal.
Recent papers have studied clustering coefficient on a semantic network, by relating it to the graph curvature. In our case we had to consider the problem of an hypergraph and we used in fact a paper from 2007 dealing with weighted graphs. Technically we take the minimum of this coefficient for a given couple of words.
Here, results become very difficult to check and to interprete. This coefficient is function of the entire neighborhood of a word and calculating its full value was just impossible for us. We still lack a lot of information, and we had to make some approximations whose validity is not very well settled yet. However we expect that our study shed some lights on the subject.
At least qualitatively it was important to take this aspect into account since it is part of Roussel's method and since it allowed us to conceptualize our approach in terms of map in an unexpected and interesting way. If you look at the Dadamap section, the Utilitarianism zone and the Equivocation line, are directely related to this clustering coefficient (or curvature) approach.
Step 3: Visualizing these structures with maps, graphs or global indexes and interpreting the results
We are now in possession of huge lists of numbers and we need to interprete and visualize this information. This is a diffcult problem but not hopeless. There are two basic technics here: selforganizing maps and graphs visualization.
1) A selforganizing map map is a standard and very helpful technique used for visualizing lowdimensional views of highdimensional data by preserving the topological properties of the highdimensional initial space. It is generated with a neural network that gathers together pixels wich are of similar colours. The model was first described as an artificial neural network by the Finnish professor Teuvo Kohonen, and is sometimes called a Kohonen map. Here is an example of a generation of such a map with a very small set of couples. At the beginning, coloured pixels are randomly picked and they organize themselves. In the Dadamap section you will discover our interpretation of the different zones, which is the main result of our study.
2) Finally our data have been used for building an interactive graph. It is described in the R.R.Engine section.
Search engines hold in their database the sum of everything that is being written on the Web, and, in the next future, with the ever increasing development of surveillance technologies and global pervasiveness, they might just be able to store everything that is being said by human beings.
The question is then what to do with this huge amount of information. The exploitation of this information provided one of the most impressive post911 IPO, namely Google's, thanks to its famous Adwords system, in 2004, after what they became the most famous brand in the world. It is done thanks to extremely sophisticated mathematical tools.
Recentely there have been huge breakthroughs in graph theory, like the theory of scalefree networks by the mathematician AlbertLászló Barabási around the year 2000. Since then, scientists have been using these revivified theories in the framework of language and quantitative linguistics, renewing a field that had not made much progress since the advent of information and communication theory by Claude Shannon in the 1950's.
Step 1: Extracting massive amount of textual information from Google
The Dadameter deals with language at large scale and it needs some of these information Google has in her belly. More precisely, what we are looking for are the number of pages that contain two different words, for any two words. These numbers allow to compute the correlations between two different words, ie. the probalility to go from one word to the other if we consider language as a global hypergraph of words and pages. These numbers are actually given by Google as shown in the picture.
We start by building a list of words. For the present case we started with a list of about 7000 current english words (and also some nonenglish as well as a few surnames). We have then to extract these figures for about 2.5 millions couples of words (7000 x 7000 / 2). Since we were mainly interested in words that look like a bit each other, this quantity is reduced to about 800 000 couples
Step 2: Analyzing this information using recent breakthroughs in the field of graph theory in order to understand the large scale structure of language in terms of homophony and semantics
The information extracted from Google as described above allow us to classify our couples of words according to semantic relatedness. As you will see there are mathematical formulae that allow to do that in the framework of graph theory. But first, remember that our starting point is Raymond Roussel's work and let's have a look at the other aspect of the question: homophony, which is quite easier and well known.
1  in terms of homophony: DamerauLevenshtein distance
The question of homophony within information theory has been investigated since the 1950's. DamerauLevenshtein distance allows to measure the amount of difference between two sequences (written words) and therefore is a measure of their homophonic resemblance (not all the question of homophony is described by this quantity since it stays on the level of written letters and not real phonetics but it will be enough for us and matches quite well with the method of Roussel  although he also made a broader use of homophony). For instance the DamerauLevenshtein distance between «billard» and «pillard» is low (it is 1 according to the definition below). The DamerauLevenshtein distance between «kitten» and «sitting» is 3, since these three edits change one into the other, and there is no way to do it with fewer than three edits:
kitten > sitten (substitution of 's' for 'k')
sitten > sittin (substitution of 'i' for 'e')
sittin > sitting (insert 'g' at the end)
Precisely, In information theory and computer science, Damerau–Levenshtein distance is a «distance» (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two characters.
A first part of our work was then to compute all the Damerau–Levenshtein distances between the 7000 x 7000 / 2 couples. That was the easy part.
2  in terms of semantic relatedness: Google Similarity distance
Very recentely there have been important progress in the field of graph theory and quantitative linguistics. A recent paper from 2005 by Rudi L. Cilibrasi and Paul M.B. Vit´anyi, has defined a (pseudo)distance called Google Similarity Distance (or Normalized Google Distance) that reflects the semantic relatedness of two words, using information extracted from search engines (cf Step 1). The distance between two words is smaller when both words are closely related, ie. when they tend to appear more often in the same webpages.
This Normalized Google Distance is actually an improvement of a simpler and more understandable formula defining the similarity between two words a and b:
s(a, b) = f(a,b) / sqrt (f(a) x f(b)) = Number of pages containing both a and b divided by the squareroot of (Number of pages containing a × Number of pages containing b).
Then d(a, b) = 1 — s(a,b) is the disimilarity between both words.
This distance depends on the figures that we have extracted from Google. After having scanned about 800 000 pages of search results, we were able to get all these quantities for our 7000 words and 800 000 couples of words. Results are impressive and seem to match quite well with semantic intuition, as it is argued in the paper by Cilibrasi and Vit´anyi.
3  in terms of equivocation: clustering coefficient
An even more complex concept is the clustering coefficient on a graph. It is interesting for our study because on a semantic graph, it helps at defining the degree of polysemy of a word, or equivocation (as it is used by Roussel in his method). Intuitively if a word belong to two clusters it tends to have two different meanings. If a word belong to only one cluster, its meaning will most certainly be univocal.
Recent papers have studied clustering coefficient on a semantic network, by relating it to the graph curvature. In our case we had to consider the problem of an hypergraph and we used in fact a paper from 2007 dealing with weighted graphs. Technically we take the minimum of this coefficient for a given couple of words.
Here, results become very difficult to check and to interprete. This coefficient is function of the entire neighborhood of a word and calculating its full value was just impossible for us. We still lack a lot of information, and we had to make some approximations whose validity is not very well settled yet. However we expect that our study shed some lights on the subject.
At least qualitatively it was important to take this aspect into account since it is part of Roussel's method and since it allowed us to conceptualize our approach in terms of map in an unexpected and interesting way. If you look at the Dadamap section, the Utilitarianism zone and the Equivocation line, are directely related to this clustering coefficient (or curvature) approach.
Step 3: Visualizing these structures with maps, graphs or global indexes and interpreting the results
We are now in possession of huge lists of numbers and we need to interprete and visualize this information. This is a diffcult problem but not hopeless. There are two basic technics here: selforganizing maps and graphs visualization.
1) A selforganizing map map is a standard and very helpful technique used for visualizing lowdimensional views of highdimensional data by preserving the topological properties of the highdimensional initial space. It is generated with a neural network that gathers together pixels wich are of similar colours. The model was first described as an artificial neural network by the Finnish professor Teuvo Kohonen, and is sometimes called a Kohonen map. Here is an example of a generation of such a map with a very small set of couples. At the beginning, coloured pixels are randomly picked and they organize themselves. In the Dadamap section you will discover our interpretation of the different zones, which is the main result of our study.
2) Finally our data have been used for building an interactive graph. It is described in the R.R.Engine section.
Bibliography

Bruno, Christophe. »Google et audelà: Interview by Ariel Kyrou.« Multitudes 36 (2009).

Bruno, Christophe. »WJSpots1.« digital mcd: Musique et cultures digitales (May 2009).

Bruno, Christophe. »From Dada to Google.« Digimag 42 (2009).

Bruno, Christophe. »Video interview by Cyril Thomas.« .

Bruno, Christophe. »L`Abécédaire de Christophe Bruno.« Poptronics Boot04 (August 2007).