A common data model for harmonization in the Nordic Pregnancy Drug Safety Studies (NorPreSS)

It is necessary to carry out large observational studies to generate robust evidence about the safety of drugs used during pregnancy. In the Nordic countries, nationwide population-based health registers that document all births and dispensed prescribed drugs are valuable resources for such studies. A common data model (CDM) is a data harmonization and structuring tool that enables a unified and streamlined analytic approach for studies including data from multiple countries or databases. We describe a CDM developed for the Nordic Pregnancy Drug Safety Studies (NorPreSS), including details on data sources and structure of the data tables. We also provide an overview of the advantages and disadvantages of the approach (e.g. sharing of data analysis programs versus extra initial work to create CDM datasets from raw data). This is an open access article distributed under the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


I. INTRODUCTION
Studies of drug safety in pregnancy typically involve both rare exposures and outcomes. This necessitates conducting very large studies to generate robust evidence. In the Nordic countries, nationwide populationbased health registers that include all births, dispensed prescribed drugs, and diagnoses from specialist care are valuable resources for such studies. International or multi-database studies that aim to have a single protocol and analytic approach must account for source data heterogeneity. One way of doing this is by data harmonization. In this paper, we describe the common data model (CDM) approach to data harmonization that the Nordic Pregnancy Drug Safety Studies (NorPreSS) consortium, which started in 2017, carried out to facilitate studies based on data from five Nordic countries. In part I we explain what a CDM is and the rationale for using one. In part II we describe the data sources used in NorPreSS. In part III we describe how the NorPreSS CDM is designed and populated, by giving the structure of the various data tables. In part IV we sum up with pros and cons of the CDM approach.

What is a Common Data Model?
In a workshop report from a 2017 meeting at the European Medicines Agency, a CDM was defined as, "…a mechanism by which raw data are standardised to a common structure, format and terminology independently from any particular study in order to allow a combined analysis across several databases/datasets" (EMA 2018). Gini et al. (2020) called this "a general CDM". A general CDM is made prior to and independent of any study protocol. The US FDA Sentinel CDM (https://www.sentinelinitiative.org/methodsdata-tools/sentinel-common-data-model/) (Platt et al., 2018) and the Observational and Medical Outcomes Partnerships (OMOP) CDM (https://www.ohdsi.org/ data-standardization/the-common-data-model/) (Kent et al., 2020) are examples of general CDMs. Both of these CDMs can be applied to pregnancy research (Matcho et al., 2018;Sentinel Operations Center 2019; https://www.sentinelinitiative.org/methods-datatools/sentinel-common-data-model/mother-infantlinkage-table).
Our NorPreSS CDM is rather general but constructed with specific objectives in mind (to investigate immediate and long-term outcomes of drug use in pregnancy). It is a set of specific definitions for the structure of databases and data elements (basic units of information) that specifies translation of existing raw data elements in existing data into identically structured tables.

Designing, populating, and applying the CDM
The workflow from local raw data files via CDM datasets to final study results in NorPreSS is depicted in Figure 1. After designing the CDM, based on which data sources and variables are available in the different countries and necessary for the aims of the collaboration, the first step is to transform the country-specific data into the CDM format and populate the CDM -i.e. create the actual CDM datasets. The next step is to apply the CDM to specific research questions by creating study specific datasets according to study protocols. This typically involves mapping from CDM-variables to concepts like exposure, outcome, confounder, and is sometimes aided by "concept dictionaries" (or preconfigured rules systems) (Schneeweiss et al., 2020). A drug concept dictionary could for example be a look-up table containing all drug codes that represent an exposure, outcome, or confounder. We have not applied concept dictionaries in the NorPreSS project due to the inherent homogeneity of coding systems in the Nordic data. Further, due to the diverse study aims within our collaboration, we chose to leave these concept specifications to each study. When the CDM datasets are transferred to a central location, they can be analyzed together. When it is not possible to combine the data in a central location, common analysis scripts can be distributed to generate country-specific estimates that are later combined for a final study result.

Rationale for creating a CDM
Transforming data from different countries or healthcare providers into a uniform data structure allows for harmonized protocols, shared analysis programs, common result templates and pooling of individual-level data (in cases where data are allowed to cross national borders). In the NorPreSS project, the Finnish, Icelandic, Norwegian and Swedish data are transformed locally according to the CDM and are transferred and stored at the Norwegian Institute of Public Health where only authorized persons who work on the project can access the data. This facilitates quality checks by allowing side-by-side comparison of the contents of the populated CDM datasets from the four countries, pooling of the individual-level data, and reducing analytic personnel time since one programmer can analyze data from four countries. Data transfer to Norway requires data transfer agreements between the local data controller in Finland, Iceland and Sweden and the central data processor (Norway/the Norwegian Institute of Public Health). The Danish data are kept at Statistics Denmark due to legal regulations and analyzed by Danish researchers. A common protocol and CDM yield uniform results tables that facilitate combining results by metaanalytic techniques where aggregated data from differrent sites can be combined in a central place (Selmer et al., 2016). The NorPreSS CDM was developed by collaborators from Finland, Iceland, Norway, and Sweden, and later applied to the Danish data.

II. DATA SOURCES
Similar health and social registers exist in Denmark, Finland, Iceland, Norway, and Sweden, with some important differences, including variable names. In order to increase data privacy, some of the register holders provide the time since a reference date instead of actual calendar dates, for example the Norwegian linked register data when studies include the Norwegian Prescription Database. As such, actual dates for the other countries were converted to reference dates. Data from Finland was based on the Drugs and Pregnancy project, an ongoing study and infrastructure for drug safety studies in Finland with regular data updates for new

Medical birth registers (MBR)
Medical birth registers (MBRs) contain information on maternal, pregnancy, birth, and infant characteristics (Langhoff-Roos et al., 2014). All countries include livebirths and stillbirths from 22 weeks with a unique record for each child. The MBR in Norway additionally includes pregnancies from 12 weeks including late pregnancy terminations which require approval by a special medical assessment board. In Finland, terminations of pregnancy for fetal anomaly (TOPFAs) are available from the Register of Induced Abortions and the Register of Congenital Malformations.
Several maternal conditions (e.g., asthma, epilepsy, diabetes) are captured in the MBR in Norway in a series of binary variables based on check boxes from standard antenatal care forms. There is also space to record ICD codes for other conditions. The MBRs in Iceland and Sweden include ICD codes for maternal conditions. Since 2004, the Finnish MBR also records some maternal conditions. However, for the entire study period, information on several conditions including epilepsy and asthma were appended from the Special Refund Entitlement Register and the Care Register for Health Care in Finland.
Infant outcomes, particularly major congenital anomalies, are an important focus of NorPreSS studies and are recorded in the MBRs. Norway's MBR includes major anomalies identified up to one year after birth with specific ICD-10 diagnoses as part of a complex string variable which indicates the source of information, code, and coding system. Iceland's MBR typically includes congenital anomalies diagnosed during the delivery hospitalization, and Sweden's MBR includes anomalies diagnosed within the first three months of life. Finland has a unique Register of Congenital Anomalies with validated diagnoses, classified according to the Atlanta/CDC modification of ICD-9 for classification of major congenital anomalies (ICD-9A). In Denmark, infant outcomes and records of TOPFAs are identified from the Danish National Patient Register (Bliddal et al. 2018, Broe et al. 2020.

Prescribed drug registers (PDR)
Prescribed drug registers (PDRs) include all prescribed drugs dispensed from pharmacies (Furu et al., 2010). In Finland during the study period, this only included drugs eligible for reimbursement. However, in future, it may be possible to consider all prescriptions and dispensations, regardless of reimbursement status (Aarnio et al. 2019). Each PDR includes a record for each drug product dispensed including the dispensing date, ATC code, strength, quantity, and amount of defined daily doses (DDDs) dispensed or a Nordic Article Number to obtain this information. None include prescribed dose in a structured format. The Norwegian Prescription Database (NorPD) includes the indication for reimbursement of reimbursed prescriptions. Indications for drug reimbursement are also tied to specific prescriptions in Finland, but only when the dispensed drug has been reimbursed in a special reimbursement category for chronic illness. Codes for which indication the drugs are prescribed for are not available in the PDRs in Denmark, Iceland, or Sweden.

Primary and specialized healthcare registers
Each country has a national patient register (NPR) that includes visits to specialist healthcare services and hospitals. They record admission and discharge dates for inpatient stays, the main diagnosis for each hospitalization, and other secondary diagnoses. Outpatient visits in NPRs include the visit date and any diagnoses recorded in association with that visit. Diagnoses are given as ICD-10 codes and procedures as Nordic Medico-Statistical Committee (NOMESCO) Classification of Surgical Procedures (NCSP) codes. Unique to Denmark, outpatient diagnoses are tied to a contact that may span a long time period and cover several visits. In Iceland, dates were only given as month and year, to reduce the risk of identifying individual study subjects. However, admission length was provided for inpatients. In Iceland, data was also obtained from the Centre for Child Development and Behavior and the State Diagnostic and Counselling Centre; two outpatient specialist registers on child neurodevelopment.
Primary care registers were also available and included in the CDM for Norway (available from 2006 and more complete from 2008), Iceland, and Finland (available from 2011 and with more complete coverage from 2013). They include International Classification of Primary Care (ICPC) codes and some ICD codes (and only ICD codes in Iceland) and visit date.

Cause of death registers and national statistics
Cause of death registers record the death date and underlying and contributing causes of death. The National Statistical agencies in each country, e.g. Statistics Norway, collect similar information on migration in and out of the country, socioeconomic position indicators including educational attainment, national academic assessments, and sick leave. In Iceland, academic assessment data were provided by the Directorate of Education, rather than Statistics Iceland. In Finland and Sweden, sick leave data were provided by the Social Insurance agencies. We included those data in our CDM to define the study population or censor individuals, adjust for confounding, or as outcomes, e.g. child academic performance, in specific NorPreSS studies.

III. DESIGNING THE CDM
The NorPreSS CDM is based on a CDM developed in the Cancer Risk and Insulin Analogues (CARING) project . The CARING CDM had three basic datasets: A study population dataset (one record per individual, with fixed person characteristics), a drug exposure dataset (one record per dispensed drug) and a clinical event dataset (one record per diagnosis or procedure). In NorPreSS, the unit of analysis is typically the pregnancy, requiring a pregnancy dataset linking each pregnancy to a mother and a child dataset linking each child/fetus to a pregnancy.
Researchers in each country create a set of identically structured data tables ( Figure 2). Each file name ends in a 2-letter code indicating the country (_xx). File formats for the data tables are CSV files that can be imported into any statistical software. The full CDM specification is provided as Supplementary Information.

Unique identifiers
Since the study population focuses on pregnancies rather than individuals, we have several personal identifiers for the unique pregnancy (preg_id), mother (mother_id), and child (child_id). Information about the father/partner was not strictly necessary for our collaborative studies and was thus not available for every country. The preg_id variable was provided by the MBR in Iceland but is created by the researchers in other countries by grouping MBR records into distinct pregnancies. The mother and child IDs are encrypted IDs originally based on the personal identification numbers (PIN) assigned to all persons within the Nordic countries at birth or immigration. Therefore, child IDs have to be generated by the researchers in some instances to include pregnancies ending in stillbirth or abortion in our studies, where no PIN is registered.

Pregnancy data
Maternal and pregnancy characteristics are primarily identified from the MBRs and are collected into the table preg_xx which includes the unique pregnancy ID (preg_id) and mother ID (mother_id

Event data
The event data are derived from drug dispensing records in PDRs, diagnoses and procedures in MBR, NPR, and primary care data, cause of death registers, and national statistics data. For the tables including event data for both mother and child, a generic personal identifier (person_id) which corresponds to the mother_id or child_id in the preg_xx and child_xx tables is used. If a person born in the study period also gives birth in the study period, the same person_id will occur both as a child_id and a mother_id. With future data management in mind, we split each dataset with one for children and another for mothers with none or very few individuals appearing in both datasets (e.g. drug_child_xx and drug_mother_xx). With the exception of cause of death, these tables may contain multiple records per person_id.
The table drug_xx contains one record for each drug dispensed in the PDR. We include variables available in every PDR and additionally, variables for indications from Norway and Finland. We also include a variable to indicate the version of the ATC system that was used in the dataset. This is typically the year the data were delivered from the PDR. The WHO Collaborating Centre for Drug Statistics methodology takes a conservative approach with infrequent changes to existing ATC codes and DDDs. However, awareness of these changes and their potential impacts is important. With ATC version in the drug dataset, ATC codes and DDDs can be standardized to the most current system if future updates are appended to existing data. For example, since the Finnish data are part of the ongoing Drugs and Pregnancy project, different versions of the ATC coding system exist in this dataset, which is regularly updated with new cross-sectional data extraction appended to existing project data.
The table hosp_xx contains one record for each hospital or specialist care contact, both inpatient and outpatient in the NPR. It provides a way to efficiently count the number of hospitalizations or outpatient specialist visits in a period of time.
We gather the diagnoses from multiple data sources into one data table, diagnoses_xx (Table 1). The date assigned to a diagnosis code varies according to the source of the diagnosis. For NPRs, the date corresponds to either the outpatient visit date or inpatient admission date. For Denmark, we create a record for an outpatient diagnosis for every visit within a contact, not just the first or last visit date. For MBRs and the Finnish Register of Congenital Malformations, we assign the delivery date (or procedure date for terminations) and for the Special Refund Entitlement Register the start date of the special reimbursement entitlement as date of diagnosis. For primary care, it is the visit date. We include an identifier for whether the diagnosis was the main diagnosis for an inpatient admission or other diagnosis. Data for all procedures are in a table, procedures_xx, which is prepared using almost the same method to assign dates, but including a precise procedure date when available.
We create a dataset containing one record for each residence period per person in the source country, residence_xx, based on the national population register. It contains a unique ID for each residence period, person_id, a start and end date for the period, and the type of end: either end of data extraction, emigration, or death.
The table death_xx is based on the cause of death register with one record per person_id. All countries could provide a main cause of death (recorded in the variable death_diag) and death date and some had additional contributing causes with potentially several diagnoses recorded in the same variable, death_othdiag.

Other data
For our planned studies, we defined several data tables for specific outcomes or covariates. A table for collecting information on maternal socioeconomic characteristics contains one record for each mother and year of delivery. We have so far only included highest level of education completed. Finally, to evaluate sick leave after pregnancy and child academic performance as study outcomes, tables for these data were tailored to a common way we could structure such data in all countries.

IV. CONCLUSIONS
Harmonizing data in a CDM facilitates international collaboration in studies where data contain roughly the same information but are organized differently in the various countries/databases. A CDM can be somewhere between completely general or custom-built for one specific study protocol. The NorPreSS CDM is rather general but constructed with specific objectives in mind. Those objectives were to investigate immediate and long-term outcomes of drug use and discontinuation in pregnancy (e.g. congenital anomalies, neurodevelopmental disorders, maternal sick leave). Although extra work is required to create the harmonized CDM datasets from local raw data, this is outweighed by advantages at later stages, including: • Fewer resources are needed for data management and analysis since analysis scripts developed by one researcher can be applied to all countries' data • Facilitates an increased knowledge of data before the analysis phase • Improves transparency and consistency in data management and analysis • Easy to expand analyses, e.g. to do sensitivity analyses, and address new research questions that fit within the approved uses of the data The recent studies carried out within our collaboration before we had individually pooled and harmonized data were based on sharing aggregated data tables Reutfors et al., 2020. This approach was labor intensive, requiring the analysis to be run in each country with a unique program adapted to the local data structure. We had to define the frequency dataset needed for all analyses up front, which limited our ability to change definitions, categorizations and perform sensitivity analyses. Our current approach is more streamlined and flexible. We have the possibility to develop the analysis scripts based on pooled data from four countries. Combining two instead of five datasets in the final step (Figure 1) reduces the chances of having zero or very few outcomes among the exposed in one of them. Thus, for rare events, pooling individual data is preferable both for statistical and data privacy reasons (Selmer et al., 2016). One should, however, be aware that the process of transforming data from local raw data to CDM datasets may introduce errors that are not so easily detected. A comparison of the different CDM variables by country and other quality checks should be done to look for anomalies. The translation process may also result in large files and computing capacity issues. Another potential disadvantage is losing some granular information that is available in only some countries. For example, if age or education is grouped in broad categories in one or more countries, the CDM adopts that broad categorization for all countries. Replacing exact calendar dates with reference dates creates challenges in interpreting data on an absolute time scale. We thus built in variables that anchor the observations in calendar time, according to year/quarter of birth. Further, even though the indication for a specific prescription was only fully available in the Norwegian Prescription Database, we accommodated this in the CDM since it was valuable for our studies.
The NorPreSS CDM is being used for multiple ongoing studies in progress and under peer review Cesta et al., 2020a;Halfdanarson et al., 2020;Kjerpeseth et al., 2020). It was used in one study published in 2020 assessing the risk of major anomalies in children with prenatal exposure to modafinil, a drug primarily indicated for narcolepsy (Cesta et al., 2020b). The study, based on data from two countries, went from concept to publication in 5 months, demonstrating our ability to rapidly assess pressing drug safety questions once the data have been harmonized in a CDM, developed for such a purpose.

Funding
The study was partly supported by NordForsk Nordic Program on Health and Welfare (Nordic Pregnancy Drug Safety Studies,project No. 83539 1 Tables   Table S1. Pregnancy related variables, preg_xx. Dataset contains one record per pregnancy resulting in a birth. A separate file preg_xx_TOPFA contains one record for each pregnancy with a termination of pregnancy for fetal anomaly and includes at minimum source_country, preg_id, mother_id, deliv_date, mother_age_cat, and others as available. Apgar 1 minute NUM 0-10 Apgar apgar5