The BIG Problem with Big Data

Over the past month, three undergraduate students: Mark Freeman (Harvard University), James McVittie (University of Toronto) and Iryna Sivak (Taras Shevchenko National University) in the Fields Undergraduate Research Program under the supervision of Professor Jianhong Wu (York University) have been studying the way information (i.e. news articles, videos, photos, etc.) propagates through the online social network Digg.com. Initially, a particular model of interest was proposed and was then used to predict the theoretical behaviour of the Digg network as well as the propagation of information from user to user. In most applied modelling projects, data needed to be introduced to understand whether the model assumptions were correct as well as to validate the prediction accuracy of the model. The problem that then arose was the amount of data that was available.

In most cases, statisticians and applied mathematicians hope for a large data set in order to minimize the variance of predictions and to make interferences from large samples; however, over the past couple of years, the amount of data available in fields such as genetics to the modelling of social networks has been staggering.  For this particular project, the data was gathered from the website of Kristina Lerman (University of Southern California) (link provided below) in two separate files: digg_votes and digg_friends. The digg_friends file contains over 1.7 million entries identifying which of the 71,367 users are connected with other particular users, the time at which the connection was made and the type of connection that was made (directed or mutual). The digg_votes file contains over 3 million entries identifying which users “digged” (voted) for each particular story and the time at which the vote was made. They then had to compile both datasets to understand which particular user(s) began the propagation of the information (i.e. the source), the number of steps each voted user is away from the source, the time difference between the source’s voting time and the particular user’s voting time and finally to run a least squares optimization algorithm of the model over the voting times. Thus, it appears that dealing with large data is a clumsy and time complex procedure, but this is not always the case.

Recent breakthroughs in computer science and computational statistics have found efficient ways of dealing with data with millions of entries. For example, with the use of the GPU (Graphics Processing Unit), matrix operations and calculations can be finished faster in order of magnitude compared to the normal CPU. That is, rather than performing a single calculation very quickly with the use of the CPU, the GPU allows multiple computations to be done simultaneously at a slower rate. Additionally, with the use of parallel programming, a computer can connect to a network of computers which would allow multiple processes to be performed at the same time over many CPUs. Therefore, with the use of these specialized methods and others, big data with large number of entries can be analyzed thoroughly and results can be obtained efficiently over short periods of time.

From Left to Right: Iryna Sivak, Mark Freeman, James McVittie

From left to right:
Iryna Sivak, Mark Freeman, James McVittie

Digg Datasets:
http://www.isi.edu/integration/people/lerman/load.html?src=http://www.isi.edu/~lerman/downloads/digg2009.html

 

This entry was posted in Uncategorized. Bookmark the permalink.