Processing XMLs

The process_xml module offers a robust codebase for validating and processing XML files. It establishes an API by mapping XML content to the MetaData, WaveForms, and MedianBeats class attributes. By utilizing user-supplied configuration files, both input and output data can be fully customized using ECGprocess or user-defined functions and methods.

In the following we will illustrate the core functionality of module. First we will import the relevant functions and classes, as well as some example XML data.

[1]:

import ecgprocess.process_xml as pro_xml
import ecgprocess.utils.config_tools as config_utils
from tempfile import NamedTemporaryFile
from ecgprocess.example_data.examples import (
    config_file,
    list_xml_paths,
)

# the XML file and XSD schema paths
path = list_xml_paths()['example_1']
schema = list_xml_paths()['example_1_schema']

Loading an XML file

We will start by loading an XML file, and seeing how we can use this to define a custom configuration file.

[2]:

reader = pro_xml.ECGXMLReader()
parsed_xml = reader(path, verbose=False)
parsed_xml.tags[0:10]

[2]:

['ObservationType',
 'ObservationDateTime.Hour',
 'ObservationDateTime.Minute',
 'ObservationDateTime.Second',
 'ObservationDateTime.Day',
 'ObservationDateTime.Month',
 'ObservationDateTime.Year',
 'UID.DICOMStudyUID',
 'ClinicalInfo.ReasonForStudy',
 'ClinicalInfo.Technician.FamilyName']

Creating a configuration file based on the parsed XML file tags

The XML reader class flattens the XML file content into a single dictionary, where ] hierarchical XML tags are concatenated into individual dictionary keys. The data can be accessed directly through parsed_xml.raw_data, where the dictionary keys (as shown above) can be accessed through the tags attribute.

We will use these tags to create a configuration file, mapping XML content to the class attributes MetaData, WaveForms, and MedianBeats. We will create a dictionary with lists containing tab delimited strings where the LHS is the internal name used the ECGprocess and the RHS is the XML tag name. In the current example we will write this dictionary to file and immediately read this in again. In real applications typically one would have a configuration file stored to disk and re-used in multiple analyses.

[3]:

config_xml = {
        "WaveForms": [
            "I\tStripData.WaveformData_0.#text",
            "II\tStripData.WaveformData_1.#text",
        ],
        "MetaData": [
            "unique identifier\tUID.DICOMStudyUID",
            "number of leads\tRestingECGMeasurements.MedianSamples.NumberOfLeads",
            "resolution unit (waveforms)\tStripData.Resolution.@units",
            "resolution (waveforms)\tStripData.Resolution.#text",
            "resolution unit (medianbeats)\tRestingECGMeasurements.MedianSamples.Resolution.@units",
            "resolution (medianbeats)\tRestingECGMeasurements.MedianSamples.Resolution.#text",
            "sampling frequency (original)\tRestingECGMeasurements.MedianSamples.SampleRate.#text",
            "sampling frequency unit\tRestingECGMeasurements.MedianSamples.SampleRate.@units",
            "sampling number (waveforms)\tStripData.ChannelSampleCountTotal",
            "sampling number (medianbeats)\tRestingECGMeasurements.MedianSamples.ChannelSampleCountTotal",
            "age\tPatientInfo.Age.#text",
            "gender\tPatientInfo.Gender",
            "birthday day\tPatientInfo.BirthDateTime.Day",
            "birthday month\tPatientInfo.BirthDateTime.Month",
            "birthday year\tPatientInfo.BirthDateTime.Year",
            "sysbp unit\tPatientVisit.SysBP.@units",
            "diabpb unit\tPatientVisit.DiaBP.@units",
            "sysbp\tPatientVisit.SysBP.@text",
            "diabpb\tPatientVisit.DiaBP.@text",
            "pacemaker\tPatientInfo.PaceMaker",
        ]
    }
with NamedTemporaryFile("w") as tmp_file:
    _ = config_file(path=tmp_file.name, text=config_xml)
    parser_xml = config_utils.ConfigParser(tmp_file.name)()
# adding the mapper
parser_xml.map(mapper=config_utils.DataMap())
print(parser_xml)

ConfigParser
[WaveForms]
        I                                StripData.WaveformData_0.#text
        II                               StripData.WaveformData_1.#text

[MetaData]
        unique identifier                UID.DICOMStudyUID
        number of leads                  RestingECGMeasurements.MedianSamples.NumberOfLeads
        resolution unit (waveforms)      StripData.Resolution.@units
        resolution (waveforms)           StripData.Resolution.#text
        resolution unit (medianbeats)    RestingECGMeasurements.MedianSamples.Resolution.@units
        resolution (medianbeats)         RestingECGMeasurements.MedianSamples.Resolution.#text
        sampling frequency (original)    RestingECGMeasurements.MedianSamples.SampleRate.#text
        sampling frequency unit          RestingECGMeasurements.MedianSamples.SampleRate.@units
        sampling number (waveforms)      StripData.ChannelSampleCountTotal
        sampling number (medianbeats)    RestingECGMeasurements.MedianSamples.ChannelSampleCountTotal
        age                              PatientInfo.Age.#text
        gender                           PatientInfo.Gender
        birthday day                     PatientInfo.BirthDateTime.Day
        birthday month                   PatientInfo.BirthDateTime.Month
        birthday year                    PatientInfo.BirthDateTime.Year
        sysbp unit                       PatientVisit.SysBP.@units
        diabpb unit                      PatientVisit.DiaBP.@units
        sysbp                            PatientVisit.SysBP.@text
        diabpb                           PatientVisit.DiaBP.@text
        pacemaker                        PatientInfo.PaceMaker

[4]:

### Mapping the XML content to the API entry points
extract = parsed_xml.extract(config=parser_xml)
### showing the API content
# Notice that the lead names which are exlcuded by the config file are set to `None`, this is ensured by the `DataMap`
# class which makes sure privileged data such as the leads are always present.
print(f'Metadata:\n{extract.MetaData}\nWaveforms:\n{extract.WaveForms}\nMedianBeats:\n{extract.MedianBeats}')

Metadata:
{'unique identifier': '1.2.840.113619.2.235.305770679234075180681238120', 'number of leads': 12, 'resolution unit (waveforms)': 'uVperLsb', 'resolution unit (medianbeats)': 'uVperLsb', 'resolution (waveforms)': 5, 'resolution (medianbeats)': 5, 'sampling frequency (original)': 500, 'sampling frequency unit': 'Hz', 'sampling number (waveforms)': 5000, 'sampling number (medianbeats)': 600, 'age': 53, 'gender': 'MALE', 'birthday day': 1, 'birthday month': 1, 'birthday year': 1965, 'sysbp unit': 'mmHg', 'diabpb unit': 'mmHg', 'sysbp': None, 'diabpb': None, 'pacemaker': 'no', 'duration (sec)': 10.0, 'sampling frequency (processed)': 500}
Waveforms:
{'I': array([ -4,  -2,  -2, ..., -17, -22, -30], shape=(5000,)), 'II': array([ 22,  23,  23, ...,  -3,  -6, -10], shape=(5000,)), 'III': None, 'V1': None, 'V2': None, 'V3': None, 'V4': None, 'V5': None, 'V6': None, 'aVF': None, 'aVL': None, 'aVR': None}
MedianBeats:
{'I': None, 'II': None, 'III': None, 'V1': None, 'V2': None, 'V3': None, 'V4': None, 'V5': None, 'V6': None, 'aVF': None, 'aVL': None, 'aVR': None}

Using the config file to map multiple strings to a single dictionary entry

It is not uncommon for XMLs containing ECG metadata to record string across multiple tags. In these cases the number of XML tags is often not predefined and extracting these through a rigid config file is not ideal.

To address this config file entries starting with the [STARTSWITH] string will be used to match multiple XML entries and string join their content. Let use this to make the diagnosis and conclusion text to dictionary keys.

[5]:

config_xml = {
        "WaveForms": [
            "I\tStripData.WaveformData_0.#text",
            "II\tStripData.WaveformData_1.#text",
        ],
        "MetaData": [
            "unique identifier\tUID.DICOMStudyUID",
            "number of leads\tRestingECGMeasurements.MedianSamples.NumberOfLeads",
            "resolution unit (waveforms)\tStripData.Resolution.@units",
            "resolution (waveforms)\tStripData.Resolution.#text",
            "resolution unit (medianbeats)\tRestingECGMeasurements.MedianSamples.Resolution.@units",
            "resolution (medianbeats)\tRestingECGMeasurements.MedianSamples.Resolution.#text",
            "sampling frequency (original)\tRestingECGMeasurements.MedianSamples.SampleRate.#text",
            "sampling frequency unit\tRestingECGMeasurements.MedianSamples.SampleRate.@units",
            "sampling number (waveforms)\tStripData.ChannelSampleCountTotal",
            "sampling number (medianbeats)\tRestingECGMeasurements.MedianSamples.ChannelSampleCountTotal",
            "age\tPatientInfo.Age.#text",
            "gender\tPatientInfo.Gender",
            "birthday day\tPatientInfo.BirthDateTime.Day",
            "birthday month\tPatientInfo.BirthDateTime.Month",
            "birthday year\tPatientInfo.BirthDateTime.Year",
            "sysbp unit\tPatientVisit.SysBP.@units",
            "diabpb unit\tPatientVisit.DiaBP.@units",
            "sysbp\tPatientVisit.SysBP.@text",
            "diabpb\tPatientVisit.DiaBP.@text",
            "pacemaker\tPatientInfo.PaceMaker",
            "diagnosis\t[STARTSWITH]Interpretation.Diagnosis.DiagnosisText",
            "conclusion\t[STARTSWITH]Interpretation.Conclusion.ConclusionText",
        ]
    }
with NamedTemporaryFile("w") as tmp_file:
    _ = config_file(path=tmp_file.name, text=config_xml)
    parser_xml_2 = config_utils.ConfigParser(tmp_file.name)()
# adding the mapper
parser_xml_2.map(mapper=config_utils.DataMap())
print(parser_xml_2)

ConfigParser
[WaveForms]
        I                                StripData.WaveformData_0.#text
        II                               StripData.WaveformData_1.#text

[MetaData]
        unique identifier                UID.DICOMStudyUID
        number of leads                  RestingECGMeasurements.MedianSamples.NumberOfLeads
        resolution unit (waveforms)      StripData.Resolution.@units
        resolution (waveforms)           StripData.Resolution.#text
        resolution unit (medianbeats)    RestingECGMeasurements.MedianSamples.Resolution.@units
        resolution (medianbeats)         RestingECGMeasurements.MedianSamples.Resolution.#text
        sampling frequency (original)    RestingECGMeasurements.MedianSamples.SampleRate.#text
        sampling frequency unit          RestingECGMeasurements.MedianSamples.SampleRate.@units
        sampling number (waveforms)      StripData.ChannelSampleCountTotal
        sampling number (medianbeats)    RestingECGMeasurements.MedianSamples.ChannelSampleCountTotal
        age                              PatientInfo.Age.#text
        gender                           PatientInfo.Gender
        birthday day                     PatientInfo.BirthDateTime.Day
        birthday month                   PatientInfo.BirthDateTime.Month
        birthday year                    PatientInfo.BirthDateTime.Year
        sysbp unit                       PatientVisit.SysBP.@units
        diabpb unit                      PatientVisit.DiaBP.@units
        sysbp                            PatientVisit.SysBP.@text
        diabpb                           PatientVisit.DiaBP.@text
        pacemaker                        PatientInfo.PaceMaker
        diagnosis                        [STARTSWITH]Interpretation.Diagnosis.DiagnosisText
        conclusion                       [STARTSWITH]Interpretation.Conclusion.ConclusionText

[6]:

### Mapping the XML content to the API entry points
extract = parsed_xml.extract(config=parser_xml_2)
### showing the metadata diagnosis and conclusion - note the addition of the [DELIM] keyword.
print(f"Metadata diagnosis:\n{extract.MetaData['diagnosis']},\nand metadata conclusion:\n{extract.MetaData['conclusion']}.")

Metadata diagnosis:
Sinus bradycardia[DELIM]Otherwise normal ECG[DELIM]---[DELIM]Arrhythmia results of the full-disclosure ECG[DELIM]QRS Complexes: 4,
and metadata conclusion:
Sinus bradycardia[DELIM]Otherwise normal ECG[DELIM]---[DELIM]Arrhythmia results of the full-disclosure ECG[DELIM]QRS Complexes: 4.

Validating an XML file and applying a minimal amount of data augmentation

The XML reader class can additional validate the XML file using an XML schema. XML validation is higly recomended to ensure the content matches expetation and identify files with unanticipated structure.

Furthermore, if the augmented leads are omitted from the source XML file, we can calculate them and resample the ECG signals to a standardized frequency of 500 Hz.

[7]:

reader = pro_xml.ECGXMLReader(augment_leads=True, resample_500=True)
parsed_xml = reader(path, schema=schema, verbose=False)