An extensible toolkit for community detection in networks

Biological interaction networks are often organized into groups (also called clusters, modules, or communities) of related genes and proteins carrying out specific biological functions. Community detection has numerous applications for systems that can be described as graphs, for example metabolic, neural, social (e.g. Facebook) and technological networks.

Jmod is an open-source Java library that can be easily integrated in third-party software applications to perform module detection in networks. It provides a repository of community detection methods - including a novel methods which we have developed - which is aimed to be further extended. Jmod is also available as a standalone application with both graphical and commandline user interfaces.

The second goal of this project is to provide an intuitive and complete environment for developing novel community detection methods. Jmod implements several benchmarks and metrics for evaluating the performance of these methods. A variety of additional tools allow researchers to focus on the development of novel methods and spend less time on common aspects of community detection (e.g., reading network structures, implementing standard metrics, etc.).

One of the latest features enables to take snapshots of the community structure during community detection (see video below). This feature has proven that it can provides valuable insights into the behavior of the methods.

Jmod has been developed during my PhD thesis. T Schaffter, From genes to organisms: Bioinformatics System Models and Software, 2014.

Community detection methods

The repository of methods implemented in Jmod includes the following algorithms:

  • Newman's spectral algorithm
  • Genetic algorithm-based method
  • Brute force approach
  • Moving vertex method (MVM)
  • Global moving vertex method (gMVM)

MVM and gMVM are refinement techniques that can be used to further improve the performance of existing community detection methods.

In order to apply the above method to a large range of networks, the following network parsers are available in Jmod:

  • TSV format (tab-separated values)
  • GML format
  • DOT format (Graphviz)
  • NET format (Pajek)

Each of these formats is described in detail here.

Benchmarks for profiling community detection algorithms

The performance of community detection methods is profiled using different benchmark graphs, which can be generated using Jmod. The evaluation of inference methods usually requires the generation of hundreds of benchmark graphs. Every benchmark generators implemented in Jmod benefit from multiple processors.

Benchmark generators implemented:

Detecting functional modules in protein interaction maps

Here we applied the improved version of Newman's algorithm to identify the functional modules in the largest connected component of the Drosophila Protein Interaction Map (Guruharsha et al., 2011) which includes 1817 nodes and 10522 interactions. We identified fifty-seven modules using the GA-based community detection method that we have developed. Nodes are painted in different colors dependent on which module they have been predicted to belong to.

As an example, all thirty-one proteins that are known to be part of the Snap/SNARE complex have been identified. In addition, the community detection method applied suggests that two additional proteins (CG7133 and Sgt) also participate to this complex.

This largest connected component can be downloaded here.

Network datasets

  • Mark Newman's repository: includes graphs of many different types (biological, neural, social, technological, etc.) in GML format.
  • Stanford Large Network Dataset Collection: a substantial collection of data sets describing very large networks, including social networks, communications networks, and transportation networks.
  • Laszlo Barabasi's datasets: data compiled by Prof. Albert-Laszlo Barabasi and collaborators at the University of Notre Dame, including web data and biochemical networks.
  • Linton Freeman's datasets: data compiled by Prof. Linton Freeman at University of California. The dataset includes small (~10 nodes) and much larger cognitive networks.
  • Alex Arenas's datasets: data compiled by Prof. Alexandre Arenas and collaborators at Universidad Rovira i Virgili, including metabolic network data and the network from their study of the collaboration patterns of jazz musicians.

Do you know another one?

Copyright © 2019  Thomas Schaffter          Website: Thomas Schaffter