Massive numbers of electronic health records are currently being collected globally, including structured data in the form of diagnoses, medications, laboratory test results, and unstructured data contained in clinical narratives. This opens unprecedented possibilities for research and ultimately patient care. However, actual use of these databases in a multi-center study is severely hampered by a variety of challenges, e.g., each database has a different database structure and uses different terminology systems. In an ideal world, a harmonized approach would be available by which data and results from different databases could be combined to answer a specific research question. Standardized data models and common analytical tools should become a de facto standard. Any multi-database study should be fully transparent and thus reproducible following a pre-defined protocol. The Observational Health Data Sciences and Informatics (OHDSI, see http://www.ohdsi.org) is an international collaborative that holds the promise of making this EHR-based big data analysis a reality1. The OHDSI network contains longitudinal data on over 600 million patients observed for multiple years and comprising over 5 billion clinical observations. The data is stored in a common data model (CDM) that aims to achieve both syntactic and semantic interoperability. Syntactic interoperability refers to the common underlying data structure which enables standardized extraction tools. Syntactic interoperability focuses on the grammar in which data is described. Although the same grammar might be used, the meaning might differ. Semantic interoperability refers to a common understanding that is required to interchange information, i.e., all the data sources are mapped to a standardized terminology system. Processing of personal data is legitimate for scientific purposes if adequate safeguards are provided and followed. Therefore, retaining control is of paramount importance to the data providers if long term sustainability and maximum impact is to be achieved. To comply to the unavoidable local governance and privacy constraints a distributed network approach is followed in which the CDMs are assessed locally and aggregated results are shared centrally. A nice example of this approach is the recently published paper on characterizing treatment pathways at scale using the OHDSI network2. The aim of this paper was to assess the diversity of populations and the variance in care for type 2 diabetes, depression, and hypertension. The treatment pathways revealed that the world is moving toward more consistent therapy but that significant heterogeneity remains among practice types and nations. Results showed that almost 25% of hypertension patients followed a treatment pathway that was unique within the cohort. Most importantly from a methodological perspective is the fact that this study demonstrated the power of a data network in which standardization is one of the primary objectives. Within the active OHDSI community several working groups have been formed that focus on a wide range of research topics. The patient-level prediction workgroup's goal is to establish a standardized process for developing accurate and well-calibrated patient-centered predictive models that can be used to make predictions for multiple outcomes that are of interest to patients and can be applied to observational healthcare data from any patient subpopulation of interest. This should enable large-scale comparisons of methods and modelling techniques, for large sets of outcomes, utilizing the full EHR. Effective exploitation of the massive sets of health data demands novel methodology and an interdisciplinary approach as longitudinal data is by nature sparse and irregular-spaced; dimensionality reduction methods need to be assessed and developed to deal with the massive amount of data; leveraging the temporal information in the EHR records could possibly improve the performance of prediction models; and purely data driven approaches run the risk of resulting in incomprehensible and suboptimal models by completely ignoring the already available background knowledge of the outcome and its etiology. The goal of the Patient-Level Prediction workgroup is to research and develop solutions for patient-level prediction modelling in massive sets of real-world EHR data, considering the challenges described above. Transparency and proper validation is a clear requirement for building clinically relevant prediction models3. Fortunately, the OHDSI data network facilitates external validation in many databases that standardize to the CDM. The workgroup has built a “PatientLevelPrediction” R package that runs against the CDM. The package provides a framework for the generation and evaluation of a diverse set of models, e.g., logistic regression models and random forests, for any type of outcome in a pre-defined patient cohort. The invited talk will focus on the challenges of creating the global OHDSI distributed EHR network and will demonstrate its power in the context of treatment pathway characterization and patient-level predictive modelling.