The exponential growth of biological network database has increasingly rendered the global network similarity search (NSS) computationally intensive. Given a query network and a network database, it aims to find out the top similar networks in the database against the query network based on a topological similarity measure of interest. With the advent of big network data, the existing search methods may become unsuitable since some of them could render queries unsuccessful by returning empty answers or arbitrary query restrictions. Therefore, the design of NSS algorithm remains challenging under the dilemma between accuracy and efficiency.Results:
We propose a global NSS method based on regression, denotated as NSSRF, which boosts the search speed without any significant sacrifice in practical performance. As motivated from the nature, subgraph signatures are heavily involved. Two phases are proposed in NSSRF: offline model building phase and similarity query phase. In the offline model building phase, the subgraph signatures and cosine similarity scores are used for efficient random forest regression (RFR) model training. In the similarity query phase, the trained regression model is queried to return similar networks. We have extensively validated NSSRF on biological pathways and molecular structures; NSSRF demonstrates competitive performance over the state-of-the-arts. Remarkably, NSSRF works especially well for large networks, which indicates that the proposed approach can be promising in the era of big data. Case studies have proven the efficiencies and uniqueness of NSSRF which could be missed by the existing state-of-the-arts.Availability and Implementation:
The source code of two versions of NSSRF are freely available for downloading at https://github.com/zhangjiaobxy/nssrfBinary and https://github.com/zhangjiaobxy/nssrfPackage.Contact:
Supplementary data are available at Bioinformatics online.