Processing XMLs
The process_xml module offers a robust codebase for validating and processing XML files. It establishes an API by mapping XML content to the MetaData, WaveForms, and MedianBeats class attributes. By utilizing user-supplied configuration files, both input and output data can be fully customized using ECGprocess or user-defined functions and methods.
In the following we will illustrate the core functionality of module. First we will import the relevant functions and classes, as well as some example XML data.
[1]:
import ecgprocess.process_xml as pro_xml
import ecgprocess.utils.config_tools as config_utils
from tempfile import NamedTemporaryFile
from ecgprocess.example_data.examples import (
config_file,
list_xml_paths,
)
# the XML file and XSD schema paths
path = list_xml_paths()['example_1']
schema = list_xml_paths()['example_1_schema']
Loading an XML file
We will start by loading an XML file, and seeing how we can use this to define a custom configuration file.
[2]:
reader = pro_xml.ECGXMLReader()
parsed_xml = reader(path, verbose=False)
parsed_xml.tags[0:10]
[2]:
['ObservationType',
'ObservationDateTime.Hour',
'ObservationDateTime.Minute',
'ObservationDateTime.Second',
'ObservationDateTime.Day',
'ObservationDateTime.Month',
'ObservationDateTime.Year',
'UID.DICOMStudyUID',
'ClinicalInfo.ReasonForStudy',
'ClinicalInfo.Technician.FamilyName']
Using the config file to map multiple strings to a single dictionary entry
It is not uncommon for XMLs containing ECG metadata to record string across multiple tags. In these cases the number of XML tags is often not predefined and extracting these through a rigid config file is not ideal.
To address this config file entries starting with the [STARTSWITH] string will be used to match multiple XML entries and string join their content. Let use this to make the diagnosis and conclusion text to dictionary keys.
[5]:
config_xml = {
"WaveForms": [
"I\tStripData.WaveformData_0.#text",
"II\tStripData.WaveformData_1.#text",
],
"MetaData": [
"unique identifier\tUID.DICOMStudyUID",
"number of leads\tRestingECGMeasurements.MedianSamples.NumberOfLeads",
"resolution unit (waveforms)\tStripData.Resolution.@units",
"resolution (waveforms)\tStripData.Resolution.#text",
"resolution unit (medianbeats)\tRestingECGMeasurements.MedianSamples.Resolution.@units",
"resolution (medianbeats)\tRestingECGMeasurements.MedianSamples.Resolution.#text",
"sampling frequency (original)\tRestingECGMeasurements.MedianSamples.SampleRate.#text",
"sampling frequency unit\tRestingECGMeasurements.MedianSamples.SampleRate.@units",
"sampling number (waveforms)\tStripData.ChannelSampleCountTotal",
"sampling number (medianbeats)\tRestingECGMeasurements.MedianSamples.ChannelSampleCountTotal",
"age\tPatientInfo.Age.#text",
"gender\tPatientInfo.Gender",
"birthday day\tPatientInfo.BirthDateTime.Day",
"birthday month\tPatientInfo.BirthDateTime.Month",
"birthday year\tPatientInfo.BirthDateTime.Year",
"sysbp unit\tPatientVisit.SysBP.@units",
"diabpb unit\tPatientVisit.DiaBP.@units",
"sysbp\tPatientVisit.SysBP.@text",
"diabpb\tPatientVisit.DiaBP.@text",
"pacemaker\tPatientInfo.PaceMaker",
"diagnosis\t[STARTSWITH]Interpretation.Diagnosis.DiagnosisText",
"conclusion\t[STARTSWITH]Interpretation.Conclusion.ConclusionText",
]
}
with NamedTemporaryFile("w") as tmp_file:
_ = config_file(path=tmp_file.name, text=config_xml)
parser_xml_2 = config_utils.ConfigParser(tmp_file.name)()
# adding the mapper
parser_xml_2.map(mapper=config_utils.DataMap())
print(parser_xml_2)
ConfigParser
[WaveForms]
I StripData.WaveformData_0.#text
II StripData.WaveformData_1.#text
[MetaData]
unique identifier UID.DICOMStudyUID
number of leads RestingECGMeasurements.MedianSamples.NumberOfLeads
resolution unit (waveforms) StripData.Resolution.@units
resolution (waveforms) StripData.Resolution.#text
resolution unit (medianbeats) RestingECGMeasurements.MedianSamples.Resolution.@units
resolution (medianbeats) RestingECGMeasurements.MedianSamples.Resolution.#text
sampling frequency (original) RestingECGMeasurements.MedianSamples.SampleRate.#text
sampling frequency unit RestingECGMeasurements.MedianSamples.SampleRate.@units
sampling number (waveforms) StripData.ChannelSampleCountTotal
sampling number (medianbeats) RestingECGMeasurements.MedianSamples.ChannelSampleCountTotal
age PatientInfo.Age.#text
gender PatientInfo.Gender
birthday day PatientInfo.BirthDateTime.Day
birthday month PatientInfo.BirthDateTime.Month
birthday year PatientInfo.BirthDateTime.Year
sysbp unit PatientVisit.SysBP.@units
diabpb unit PatientVisit.DiaBP.@units
sysbp PatientVisit.SysBP.@text
diabpb PatientVisit.DiaBP.@text
pacemaker PatientInfo.PaceMaker
diagnosis [STARTSWITH]Interpretation.Diagnosis.DiagnosisText
conclusion [STARTSWITH]Interpretation.Conclusion.ConclusionText
[6]:
### Mapping the XML content to the API entry points
extract = parsed_xml.extract(config=parser_xml_2)
### showing the metadata diagnosis and conclusion - note the addition of the [DELIM] keyword.
print(f"Metadata diagnosis:\n{extract.MetaData['diagnosis']},\nand metadata conclusion:\n{extract.MetaData['conclusion']}.")
Metadata diagnosis:
Sinus bradycardia[DELIM]Otherwise normal ECG[DELIM]---[DELIM]Arrhythmia results of the full-disclosure ECG[DELIM]QRS Complexes: 4,
and metadata conclusion:
Sinus bradycardia[DELIM]Otherwise normal ECG[DELIM]---[DELIM]Arrhythmia results of the full-disclosure ECG[DELIM]QRS Complexes: 4.
Validating an XML file and applying a minimal amount of data augmentation
The XML reader class can additional validate the XML file using an XML schema. XML validation is higly recomended to ensure the content matches expetation and identify files with unanticipated structure.
Furthermore, if the augmented leads are omitted from the source XML file, we can calculate them and resample the ECG signals to a standardized frequency of 500 Hz.
[7]:
reader = pro_xml.ECGXMLReader(augment_leads=True, resample_500=True)
parsed_xml = reader(path, schema=schema, verbose=False)