This post is part of our ongoing series of guest articles which explore web analytics and using Woopra. If you would like to contribute to this series, contact Lorelle at Woopra.com with your proposal.
As a social scientist I like to dream of my ideal data set. Every scientists does so once in a while, I imagine, for what questions could be answered if unlimited time, funds, and technological capacities were available. Wouldn't a rocket scientist want to gather some of the soil on every known planet? I think a cognitive scientist would love to experiment on more people than are usually available. What would a present-day physicist want? A Larger Hadron Collider (LHC) perhaps? Recently, a communication researcher saw such a dream actually come through, when he gained access to data on all mobile phone calls made in one country over a several-weeks-period of time, resulting in 7,000,000 records.
The data-needs of a social scientist like me are quite more modest, but difficult enough to fulfill as they are. Performing surveys takes a big bite out of the budget available to many researchers, while the use of existing survey data (i.e. the large scale World Value Survey) restricts the researcher in what (type of) questions can be answered, for only the predefined survey-questions are available to the researcher.
Six Degrees of Separation
A famous theorem of social sciences that is continuously hampered by the availability of data, is often referred to as the 'Six degrees of separation'. Based on a 1929 story by Frigyes Karinthy, the Six Degrees of Separation theorem states that all people in the world are interconnected by chains of mutual friends which have an average length of 6. For instance, let's assume that I don't know my Prime Minister personally, but that I do know my girlfriend will attend an art reception, accompanied by a local politician. This local guy personally knows a member of parliament. I don't know that member of parliament, but we can suppose that he knows the Prime Minister, or at least has met him on occasion. In this example, I know the Prime Minister within four degrees of separation between us: (Me -> girlfriend -> local politician -> member of parliament -> Prime Minister).
Personally, I do not find the basic six degrees of separation theorem that interesting by itself. However, it becomes all the more interesting when we connect it to issues of social inequality. We then come to realize that the number of degrees of separation is unequally distributed over social strata. This can have consequences for the amount of social mobility some people are able to achieve, especially for knowing people will help to get you a job, to get you motivated, to feel confident about yourself, and what else... Thereby, the basic 'six degrees of separation-theorem' is transformed to questions as 'who knows whom?', 'Who knows what from whom?', and 'What effect has knowing certain people on your career?' In general, these kind of questions cover the consequences of social capital, still a heavily researched topic.
How do we investigate 'who knows whom', exactly? When we want to connect this question to the consequences social capital has for people, the social scientist also needs to have information on a variety of characteristics of people, such as background and, for instance, career development. But it does not need to be all about the careers people have; another interesting and important research question would be to investigate what specific groups of people know and think about the lives of other groups of people. All in all, this clearly poses high demands for the quality of the available survey data.
Six Blogs of Separation?
How do you obtain that high quality data? There is no national or international register of friendships, or something similar, for which there would be great debate over privacy concerns. So where do we find the connections? The data on mobile phone calls I referred to above comes close, but does not enable the researcher to study characteristics of the individual callers, other than their mobile-phone behavior.
The same goes for the activities Google undertakes to map the Internet. Google already maps the way the information on the Internet is inter-connected (referring to the cell phones: all cell phones are potentially interconnected). Knowing the paths is not the same as knowing who 'walks' on these paths. When interested in social structures, that is the point where it might become very interesting. Both Google and the cell-phone data-set have what the other lacks: one has information on the existing connections, the other information on the usage of these connections. None has actual information on background and other characteristics of those who use these connections.
For sociological questions to be answered using data collected on the Internet, more is needed. Technological advances, including the increase in computational power, storage space and software development, allow for gathering increasingly detailed information on the users of the Internet. A very interesting development can be seen in applications of software that gather the visitor-statistics of Internet sites and weblogs.
A newcomer to this is Woopra, which combines a plugin for websites with software on the site administrator's computer, thereby giving the user an amazing amount of detailed information on the website's visitors. Users can even be tagged manually or automatically when the visitor is a member of the visited blog or when she or he leaves a comment. Furthermore, it is known where visitors come from, what brought them to the blog (a link on another site, search engine, direct access), which pages the visitors looked at, and to where the visitors left if they exited though an external link. Of all this detailed history is recorded so that this information can be analyzed retrospectively.
Imagine the possibilities! With a little work, it should be possible to combine the information gathered on several blogs (users of Woopra are already capable to share their stats-pages to other Woopra-users). If the number of participating blogs that share this information is large enough, it becomes possible to trace the paths of individual visitors throughout different blogs. This can be done based on tagged visitors, but also by matching the exit and entry information of visitors of different blogs.
The basic starting point would be to investigate the number of blogs that are interconnected and the number of interconnected blogs needed to connect one blog to another. We would then arrive at the 'six blogs of separation' theorem. This is already be done by Google, but by using information gathered by Woopra-like systems, we can not only look at the existence of connections, but simultaneously at the usage of those connection as well. When we categorize blogs, which can be done manually or automatically based on keywords, we can investigate questions as:
- Do types of blogs link to similar types of blogs?
- To what extent do visitors of one type of blog visit other types of blogs?
- What are the paths through different types of blogs that individuals follow?
For these last questions, information gathered by different blogs should indeed be merged.
The possibilities are just starting to emerge here. Bloggers can generally be contacted easily via their blog. This would allow researchers to send links to web-surveys to gain information on the background of the bloggers, and some other characteristics. With the cooperation of some of these bloggers, they could get the buzz going by writing about the survey, referring more and more people to the survey. This would open up an enormous amount of new research questions:
- To what end do people visit blogs?
- What do people gain from it?
- Do people find a better job when they read (specific) blogs, and does this hold for different strata in society?
- Do people from different social groups get connected by reading each others' blogs, or are social cleavages represented in the blogosphere as well?
- What do people learn from other (groups of) people by reading their blogs?
We Can, But Should We?
It is, I hope, clear that the possibilities that are opened up by recent developments are enormous. However, with great possibilities come great responsibilities. Great caution should be taken into account regarding the privacy of people. Is it allowed to use information on an oblivious visitor blog-visitor in such an investigation?
The people from Woopra argue that it is up to the users of Woopra to think about to what end they will use the data they collect. Thereby, indeed a researcher should take great care when using a research design such as the one introduced here. As long as all information is reported on anonymously, I think that a distinction can be made between describing the ways blog connect to each other (including perhaps the aggregated use of these connections) and the disentangling of the use of these connections to the level of the individual visitor. Clearly, this can be discussed (yes, this is an invitation) and will require much more thought when this design is put to actual use. Aggregated web-use has been analyzed ever since the coming of the web and individual bloggers already know the numbers detailing their own visitors. But, when individual people are related to those numbers, and when we investigate exactly 'who went where, when', the privacy of these visitors should be protected and their informed consent is required. But, since the social research will need information on these people anyway (using the survey), the opportunity exists to ask for their informed consent.
A Dream Come True!
Clearly, a lot of work has to be done to ascertain the validity of this type of research. This blog post should not be read as a concrete research proposal, for many hurdles have to be overcome. I did not make a real effort to go into all the difficulties and problems of Internet-based surveying and sampling, but then again, that wasn't really the purpose of this contribution. But this contribution is clear in showing how technological advances create the possibility of new and exciting research questions. Should anyone be willing to perform this type of research, please write me down as a co-author ;-). With a little work and thought, it might just as well prove to be a dream come true!
Rense Nieuwenhuis is an (aspiring) sociologist. His present interests are on social inequality, mobility, social capital, statistics, methodology, and many other aspects of social science. On his blog 'Curving Normality' he writes on issues regarding social sciences in general, R-Project, the applied philosophy of science, and uncertainty associated with the scientific method. In many of his posts, he attempts to connect these issues to present societal issues.