Share |

Finding a leader in a crowd

This image shows the relationship between users (blue) and documents (red). The visualization was generated from a sample of 40,000 revisions of Wikipedia, only a portion of a larger analysis that resulted in a graph with 1,027,452 vertices and 10,652,296 edges. Image courtesy of David Braun, Purdue University.

Commentators suggest that Wikipedia and other collaborative network-driven projects, such as the Linux computer operating system, could be an emerging socioeconomic paradigm no less radical and disruptive than the Industrial Revolution.

Wikipedia is meant to be a new way of doing things in a world of ubiquitous electronic information and social networks, one that may be changing the conduct of everything from scientific research to political campaigns.

Sociological commentary and predictions aside, however, do Wikipedia and other “crowd-sourced” efforts really function so differently? Purdue communications researcher Sorin Adam Matei and his team are testing the concept by analyzing Wikipedia articles and revisions produced between 2001 and 2008 – a computationally demanding task. They’re finding that Wikipedia may have more in common with an old-fashioned factory floor than we think.

In theory, the collaborative online encyclopedia’s entries are created, edited and maintained with little in the way of traditional, hierarchical organizational and leadership structures. The production of Wikipedia has been characterized as an emergent system, like an ant colony, resulting from collective actions of individual participants with the “wisdom of the crowd” yielding a viable outcome.

Among other things, Matei’s team is looking at how article production is distributed across users and users’ contributions relative to each other over time. The research includes visualizations of patterns to make them easier to discern.

This visualization shows relationships between users—so-called “betweenness centrality”—who worked on a sample of the Wikipedia articles created between 2001 and 2008, all of which are being examined in a Purdue study. Image courtesy of David Braun, Purdue University.

Early results may indicate that Wikipedia isn’t as communal, egalitarian and free of division of labor as thought. Hierarchies featuring bosses and workers, elites and the not-so-elite, have developed. This may, in fact, be necessary when humans organize to produce something as complex as an encyclopedia, despite the essentially democratic nature of network technologies that can, theoretically, allow anyone to participate equally.

“We need to reconsider the way we think about these environments,” said Matei. “There’s a tendency for the collaboration to become centralized, to become dominated by specific voices, which leads to much more structure than we would imagine.”

Purdue is creating an online repository of Wikipedia data and an analysis tool available to any researcher. Matei already has developed a small application, called Visible Effort, for calculating entropy on websites using MediaWiki, the software behind Wikipedia and many other wiki sites.

David Braun, a research computing specialist based at Purdue’s Rosen Center for Advanced Computing, is working with Matei and his team to help enable their research. Braun drew on a variety of resources to make their research possible.

The computations involved in studying the Wikipedia network do not call for any communication to occur between different compute nodes. That means that the system they use can have a high latency, making it well-suited to a grid. Braun began the process with DAGMan, a meta-scheduler for Condor.

“The way I structured it was I structured it by articles,” Braun explained. “So, each Condor job was based on a set of articles out of the wiki.”

Braun used TeraGrid to submit the jobs to the Purdue Condor pool, an opportunistic computational resource that can also be accessed through Open Science Grid and DiaGrid. To process the entirety of Wikipedia’s articles and their history of revisions, they accessed approximately 200 cores for 24 hours, processing about four terabytes of information.

Visualizations of the results (see images above) were generated using the network visualization program Gephi. But those visualizations are incomplete.

“That was just a subset of nodes,” Braun said. “Imagine what that’s going to be like when we scale it up.”

—Greg Kline, Information Technology at Purdue, with edits by Miriam Boon, iSGTW

Your rating: None Average: 5 (1 vote)


Post new comment

By submitting this form, you accept the Mollom privacy policy.