A pediatric ECG database with disease diagnosis covering 11643 children

Table of Contents

Data acquisition

This study has been reviewed and approved by the Ethics Committee for Scientific Research and Clinical Trials at the First Affiliated Hospital of Zhengzhou University (2024-KY-0221-003 and 2025-KY-0369-001). It has been agreed to waive the signing of informed consent forms and to desensitize the ECG data for public scientific research.

The raw ECG data of the ZZU pECG dataset come from hospitalized children at the First Affiliated Hospital of Zhengzhou University from January 2018 to May 2024. The ECG acquisition device and model is MedEx MECG-300, with a 24-bit resolution A/D converter in mV, acquisition time of 5-120 seconds, and sampling frequency of 500 Hz. When collecting ECG data, the physician first issues an order for an ECG examination for hospitalized children, then connects the leads to the child’s body, completes data collection after a period of time, and uploads it to the MedEX ECG system. Then, a junior physician in the electrocardiographic room will make a conclusive diagnosis, which will be reviewed by a senior physician. Consensus is considered the final diagnosis; If there is a disagreement, the reviewing physician shall report and discuss it before making the final decision.

In order to pay more attention to pediatric ECG and pediatric cardiovascular diseases, we consulted doctors with senior professional titles in cardiovascular medicine, cardiovascular surgery, pediatric medicine, pediatric surgery, pediatric intensive care, and electrocardiographic room, and set the following rules:

1.

Age. In this study, we defined the age of pediatric subjects as 14 years of age or below. Besides, we provided the original age of all children in days. In the conversion and statistical scenario of this manuscript, one year is defined as 365 days, and one month is defined as 30 days. However, the most accurate unit is still in days.
3.

ECG leads. Notably, it was usually intractable toput all 6 chest leads on such a young child’s chest due to the insufficient cooperation and incomplete development³¹. Based on the advice of pediatric cardiologists, we believe that Children aged 7 years and above should have a complete 12 lead ECG, which includes I, II, III, aVR, aVL, aVF, V1, V2, V3, V4, V5, V6, Children under the age of 7 who lack V2, V4, and V6 leads on their ECG are also allowed to be represented by V1, V3, and V5 leads. There are 1856 ECG records in the ZZU pECG dataset with 9 leads.

Table 2 Overview of disease type and category.
4.

Disease diagnosis labels. There are hundreds of diseases in the original medical records. We particularly focus on cardiovascular disease labels in Table 2, while keeping other disease labels without separate counting. The ICD codes for other diseases can be obtained from the attribute dictionary file and ICD-10 version²⁸.

The data acquisition process is as follows.

1.

Retrieve all hospitalized cases of children aged 0-14 from January 2018 to June 2024 using the Neusoft hospital information system. The inpatient case contains the main medical information of the patient during their hospitalization period. The patient ID, date of birth, gender, admission date, discharge date, and disease diagnosis (Excluding outpatient diagnosis) are extracted from the case and exported as an Excel file.
2.

Based on the patient ID in the Excel file, retrieve all ECG records from the MedEx ECG system and export them as an XML file. This file contains the patient ID, ECG values, ECG acquisition date, and ECG diagnostic statements.
3.

Design a tool to extract information from XML files other than ECG values and convert it into CSV files.
4.

Match patient ID in Excel and CSV files, clean ECG records with the same patient ID that do not match the admission and discharge dates, and ensure that all remaining ECG records are from the patient’s hospitalization period.
5.

According to the disease diagnosis in the Excel file, all cases of myocarditis, cardiomyopathy, Kawasaki disease, and congenital heart disease were screened and retained. The specific disease information is shown in Table 2.
6.

Extract the date of birth from the Excel file and the ECG acquisition date from the XML file and calculate the patient’s age provided in days. The advantage is that the date calculation in the Python library has built-in leap year logic, so the calculated children age is the most accurate.
7.

Through the above steps, we collected all pediatric ECG records that meet the requirements, as well as the patient ID, gender, age, acquisition date, ECG diagnostic statement, and disease diagnose. Finally we selected 3516 ECG records with cardiovascular disease labels and 10674 ECG records with other disease labels.

It is the fact that The MedEx ECG system contains patient’s age, However, the age here is based on the admission date, not the collection date. For very younger patients, this error increases. Therefore, it is most reasonable and accurate for us to calculate the patient’s age by using the date of birth on the medical record homepage and the acquisition date of the ECG record.

Data processing

In the real data collection, different noises such as Gaussian noise and baseline wander noise, power line interference, and muscle artefacts corrupt the ECG wave all through its receipt and transmission³². The MedEx MECG-300 machine used in this study has a built-in filtering function, which allows experienced physicians or nurses to collect ECG signals from patients in a resting state. The quality of the remaining ECG signals is generally good and meets the requirements for use, without the need for additional filtering and denoising. Only ECG records that exclude lead detachment and lack key information such as patient ID, age, diagnostic statements, and ECG acquisition date need to be excluded. The diagnostic statements of the original ECG are in Chinese and manually entered by doctors from the electrocardiographic room. Due to the subjective differences among different doctors, conclusions of the same category also show significant differences in diagnostic statements and require unified coding conversion. The specific steps are as follows:

1.

Organize all diagnostic statements and codes in AHA and CHN standards.
2.

With the help of a physician, try to include all brief and colloquial statements of these diagnostic statements as much as possible.
3.

According to the statement design tool obtained in the previous step, use wildcard characters to extract key characters and convert all diagnostic statements of ECG into diagnostic statements and codes corresponding to AHA and CHN standards. For diagnostic statements that do not have corresponding codes found in AHA and CHN standards, we will retain them in text form.
4.

Two cardiologists checked the diagnostic statements of all ECG, and retained ECG records that were converted correctly and met AHA or CHN standards; Correct ECG records with incorrect conversion of diagnostic statements to comply with AHA or CHN standards; Discuss and perform secondary diagnosis on ECG records that cannot be corrected or do not meet both AHA and CHN standards after correction, and make a comprehensive judgment on whether to retain the record.

The AHA standard has six parts, namely primary diagnostic terms, secondary diagnostic terms, modifying vocabulary, concise comparative terms, specifications for the individual and combined application of the above standards, and commonly used combination terms. The core part is the primary diagnostic terminology. Differed form the AHA standard, the CHN standard only has the primary diagnostic statement, so its comprehensiveness and scope of application are lower than the AHA standard, and it only serves as a supplement to the ECG diagnostic statements for the Eastern population. During the encoding conversion process, three types of codes from the AHA standard were used, namely primary diagnostic statements (Nondescriptive statements, convey clinical meaning without additional statements), secondary diagnostic statements (Provides additional statements that can be used to expand the specificity and clinical relevance of both descriptive and other primary diagnostic statements. These secondary statements are divided into 2 groups : “suggests” and “consider”) and modifiers (It does not change the meaning of the core statement, but is used to refine the meaning). The conversion of CHN standard only uses the primary diagnostic statement.

In order to protect the privacy of patients, we desensitized the data and regenerate the patient ID, ensuring consistency with the patient’s ECG records, disease diagnosis, and other information. Simultaneously set a random integer offset (the specific numerical range is not disclosed) in the acquisition date. If a patient is hospitalized multiple times for ECG acquisition or multiple ECG records are collected during one hospitalization, the order of ECG examination remains unchanged.

The disease diagnosis labels of hospitalized children are obtained from the medical record homepag, which information has been reviewed and does not require secondary verification. We converted all disease diagnoses to ICD-10 codes under the guidance of experienced clinical doctors and disease coding personnel based on the ICD-10 version : 2019 released by WHO²⁸. Due to the granularity of ICD-10 code not being detailed enough to give all diseases unique labels, some cardiovascular disease codes in Table 2 are duplicated, and we have set unique identifiers for them to distinguish them. For example, the ICD codes for Fulminant Myocarditis and Viral Myocarditis are (F) I40.0 and (V) I40.0, respectively.

After the above steps, 14190 ECG records containing patient information and disease diagnosis were finally obtained. Using ECG records as the statistical unit, the number of ECG records with cardiovascular disease and non cardiovascular disease were 3716 and 10474, respectively, as shown in Table 2. These cardiovascular disease labels cannot be calculated through single summation, as a portion of the ECG records have one or more cardiovascular disease labels simultaneously.

link