Processing ECG data to tabular format

By mapping ECG data (from DICOM, XML, or from an alternative format) to an reader API we can process multiple files to tabular format. Here we will show how ECGprocess can be used to flexibly combine MetaData, WaveFroms, and MedianBeats data, for example to .tsv, .npz or tfrecord.

In the following we will illustrate the core functionality of module. First we will import the relevant functions and classes, as well as some example data.

[1]:
import os
import tempfile
import pandas as pd
import numpy as np
from ecgprocess.tabular import (
    ECGTable,
)
from ecgprocess.constants import (
    CoreData as Core,
)
from ecgprocess.process_dicom import(
    ECGDICOMReader,
)
from ecgprocess.process_xml import(
    ECGXMLReader,
)
from ecgprocess.example_data.examples import (
    parsed_config,
    list_dicom_paths,
)
from ecgprocess.utils.general import (
    list_tar,
)
from ecgprocess.utils.ecg_tools import(
    signal_resolution,
)
# #### Relevant paths
dicom_path_list = [
    list_dicom_paths()['example_dicom_1'],
    list_dicom_paths()['example_dicom_2'],
]
# #### parsed config files
parser_dicom = parsed_config()['parsed_dicom1']

Mapping ECG data to pandas.DataFrame

As an illustration we will map multiple ECG data to pandas.DataFrame’s.

[2]:
reader = ECGDICOMReader()
ecgtable = ECGTable(reader, path_list=dicom_path_list)
table = ecgtable().get_table(parsed_config=parser_dicom,)
# rotating the table to improve presentation
table.MetaData.T
[2]:
2.25.269796857626990821315969488216511468638 1.3.6.1.4.1.40744.65.221835449869280278116555339248567804126
unique identifier 2.25.269796857626990821315969488216511468638 1.3.6.1.4.1.40744.65.2218354498692802781165553...
number of leads 12 12
resolution unit (waveforms) uV uV
resolution unit (medianbeats) uV uV
resolution (waveforms) 4.88 5.0
resolution (medianbeats) 4.88 5.0
sampling frequency (original) 500.0 500.0
sampling frequency unit None None
sampling number (waveforms) 5000 5000
sampling number (medianbeats) 600 512
Softwave version [1.02 SP03, MUSE_9.0.9.18167] None
Manufacturer GE Healthcare GE
Model name MV360 MAC55
wave_channel_sens_0 4.88 5.0
wave_channel_sens_1 4.88 5.0
wave_channel_sens_2 4.88 5.0
wave_channel_sens_3 4.88 5.0
wave_channel_sens_4 4.88 5.0
wave_channel_sens_5 4.88 5.0
wave_channel_sens_6 4.88 5.0
wave_channel_sens_7 4.88 5.0
wave_channel_sens_8 4.88 5.0
wave_channel_sens_9 4.88 5.0
wave_channel_sens_10 4.88 5.0
wave_channel_sens_11 4.88 5.0
wave_channel_correctionfactor_0 1.0 1.0
wave_channel_correctionfactor_1 1.0 1.0
wave_channel_correctionfactor_2 1.0 1.0
wave_channel_correctionfactor_3 1.0 1.0
wave_channel_correctionfactor_4 1.0 1.0
wave_channel_correctionfactor_5 1.0 1.0
wave_channel_correctionfactor_6 1.0 1.0
wave_channel_correctionfactor_7 1.0 1.0
wave_channel_correctionfactor_8 1.0 1.0
wave_channel_correctionfactor_9 1.0 1.0
wave_channel_correctionfactor_10 1.0 1.0
wave_channel_correctionfactor_11 1.0 1.0
wave_channel_baseline_0 0.0 0.0
wave_channel_baseline_1 0.0 0.0
wave_channel_baseline_2 0.0 0.0
wave_channel_baseline_3 0.0 0.0
wave_channel_baseline_4 0.0 0.0
wave_channel_baseline_5 0.0 0.0
wave_channel_baseline_6 0.0 0.0
wave_channel_baseline_7 0.0 0.0
wave_channel_baseline_8 0.0 0.0
wave_channel_baseline_9 0.0 0.0
wave_channel_baseline_10 0.0 0.0
wave_channel_baseline_11 0.0 0.0
duration (sec) 10.0 10.0
sampling frequency (processed) None None
key 2.25.269796857626990821315969488216511468638 1.3.6.1.4.1.40744.65.2218354498692802781165553...
[3]:
# printing waveform data
table.WaveForms
[3]:
key Sampling sequence Lead Voltage Signal type
0 2.25.269796857626990821315969488216511468638 0 I 0.00 WaveForms
1 2.25.269796857626990821315969488216511468638 1 I 0.00 WaveForms
2 2.25.269796857626990821315969488216511468638 2 I 29.28 WaveForms
3 2.25.269796857626990821315969488216511468638 3 I 43.92 WaveForms
4 2.25.269796857626990821315969488216511468638 4 I 48.80 WaveForms
... ... ... ... ... ...
119995 1.3.6.1.4.1.40744.65.2218354498692802781165553... 4995 aVR -195.00 WaveForms
119996 1.3.6.1.4.1.40744.65.2218354498692802781165553... 4996 aVR -190.00 WaveForms
119997 1.3.6.1.4.1.40744.65.2218354498692802781165553... 4997 aVR -190.00 WaveForms
119998 1.3.6.1.4.1.40744.65.2218354498692802781165553... 4998 aVR -185.00 WaveForms
119999 1.3.6.1.4.1.40744.65.2218354498692802781165553... 4999 aVR -185.00 WaveForms

120000 rows × 5 columns

Writing to ECG data to tables

As an illustration we will write ECG data to the individual files: MetaData.tsv, WaveForms.tsv, and MedianBeats.tsv.

[4]:
with tempfile.TemporaryDirectory() as temp_dir:
    os.chdir(temp_dir)
    ecgreader = ECGDICOMReader()
    ecgtable = ECGTable(ecgreader, path_list=dicom_path_list, )
    res = ecgtable(verbose=False).write_ecg(parsed_config=parser_dicom, write_failed=False)
    print('Directory content:')
    for file_name in os.listdir(getattr(res, 'target_path')):
        print('   - {}'.format(file_name))
    # reading the data in - the temp dir will otherwise disappear
    table = pd.read_csv(os.path.join(getattr(res, 'target_path'), 'metadata.tsv.gz'),
                    sep='\t', index_col=0,)
Directory content:
   - waveforms.tsv.gz
   - metadata.tsv.gz
   - medianbeats.tsv.gz
[5]:
table.T
[5]:
2.25.269796857626990821315969488216511468638 1.3.6.1.4.1.40744.65.221835449869280278116555339248567804126
unique identifier 2.25.269796857626990821315969488216511468638 1.3.6.1.4.1.40744.65.2218354498692802781165553...
number of leads 12 12
resolution unit (waveforms) uV uV
resolution unit (medianbeats) uV uV
resolution (waveforms) 4.88 5.0
resolution (medianbeats) 4.88 5.0
sampling frequency (original) 500.0 500.0
sampling frequency unit NaN NaN
sampling number (waveforms) 5000 5000
sampling number (medianbeats) 600 512
Softwave version ['1.02 SP03', 'MUSE_9.0.9.18167'] NaN
Manufacturer GE Healthcare GE
Model name MV360 MAC55
wave_channel_sens_0 4.88 5.0
wave_channel_sens_1 4.88 5.0
wave_channel_sens_2 4.88 5.0
wave_channel_sens_3 4.88 5.0
wave_channel_sens_4 4.88 5.0
wave_channel_sens_5 4.88 5.0
wave_channel_sens_6 4.88 5.0
wave_channel_sens_7 4.88 5.0
wave_channel_sens_8 4.88 5.0
wave_channel_sens_9 4.88 5.0
wave_channel_sens_10 4.88 5.0
wave_channel_sens_11 4.88 5.0
wave_channel_correctionfactor_0 1.0 1.0
wave_channel_correctionfactor_1 1.0 1.0
wave_channel_correctionfactor_2 1.0 1.0
wave_channel_correctionfactor_3 1.0 1.0
wave_channel_correctionfactor_4 1.0 1.0
wave_channel_correctionfactor_5 1.0 1.0
wave_channel_correctionfactor_6 1.0 1.0
wave_channel_correctionfactor_7 1.0 1.0
wave_channel_correctionfactor_8 1.0 1.0
wave_channel_correctionfactor_9 1.0 1.0
wave_channel_correctionfactor_10 1.0 1.0
wave_channel_correctionfactor_11 1.0 1.0
wave_channel_baseline_0 0.0 0.0
wave_channel_baseline_1 0.0 0.0
wave_channel_baseline_2 0.0 0.0
wave_channel_baseline_3 0.0 0.0
wave_channel_baseline_4 0.0 0.0
wave_channel_baseline_5 0.0 0.0
wave_channel_baseline_6 0.0 0.0
wave_channel_baseline_7 0.0 0.0
wave_channel_baseline_8 0.0 0.0
wave_channel_baseline_9 0.0 0.0
wave_channel_baseline_10 0.0 0.0
wave_channel_baseline_11 0.0 0.0
duration (sec) 10.0 10.0
sampling frequency (processed) NaN NaN
key 2.25.269796857626990821315969488216511468638 1.3.6.1.4.1.40744.65.2218354498692802781165553...

Writing to ECG data to numpy files

Next we will write ECG data to numpy files. Given that numpy predominantly works with numerical data, non-numerical data will automatically be removed. Note that the header data of this reduced metadata array is included in the output.

[6]:
with tempfile.TemporaryDirectory() as temp_dir:
    os.chdir(temp_dir)
    ecgreader = ECGDICOMReader()
    ecgtable = ECGTable(ecgreader, path_list=dicom_path_list, )
    res = ecgtable(verbose=False).write_ecg(parsed_config=parser_dicom, write_failed=False,
                                           file_type='numpy')
    print('Directory content:')
    for file_name in os.listdir(getattr(res, 'target_path')):
        print('   - {}'.format(file_name))
    # reading the data in - the temp dir will otherwise disappear
    with np.load(os.path.join(getattr(res, 'target_path'), 'ecg_data.npz')) as data:
            meta = data['MetaData']
            median = data['MedianBeats']
Directory content:
   - ecg_data.npz
   - header_metadata.txt
[7]:
# The metadata
meta
[7]:
array([[1.20e+01, 4.88e+00, 4.88e+00, 5.00e+02, 5.00e+03, 6.00e+02,
        4.88e+00, 4.88e+00, 4.88e+00, 4.88e+00, 4.88e+00, 4.88e+00,
        4.88e+00, 4.88e+00, 4.88e+00, 4.88e+00, 4.88e+00, 4.88e+00,
        1.00e+00, 1.00e+00, 1.00e+00, 1.00e+00, 1.00e+00, 1.00e+00,
        1.00e+00, 1.00e+00, 1.00e+00, 1.00e+00, 1.00e+00, 1.00e+00,
        0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
        0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
        1.00e+01],
       [1.20e+01, 5.00e+00, 5.00e+00, 5.00e+02, 5.00e+03, 5.12e+02,
        5.00e+00, 5.00e+00, 5.00e+00, 5.00e+00, 5.00e+00, 5.00e+00,
        5.00e+00, 5.00e+00, 5.00e+00, 5.00e+00, 5.00e+00, 5.00e+00,
        1.00e+00, 1.00e+00, 1.00e+00, 1.00e+00, 1.00e+00, 1.00e+00,
        1.00e+00, 1.00e+00, 1.00e+00, 1.00e+00, 1.00e+00, 1.00e+00,
        0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
        0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
        1.00e+01]])
[8]:
# The medianbeats - note that the second entry does not have data for all the leads
median
[8]:
array([[[ -4.88,  -4.88,  -4.88, ...,  43.92,  53.68,   0.  ],
        [ 39.04,  34.16,  34.16, ..., 102.48, 102.48,   0.  ],
        [ 43.92,  39.04,  39.04, ...,  58.56,  48.8 ,   0.  ],
        ...,
        [ 43.92,  43.92,  39.04, ...,  24.4 ,  19.52,   0.  ],
        [ 39.04,  39.04,  39.04, ...,  34.16,  34.16,   0.  ],
        [ 29.28,  29.28,  29.28, ...,  39.04,  34.16,   0.  ]],

       [[-30.  , -30.  , -20.  , ...,    nan,    nan,    nan],
        [ 20.  ,  15.  ,  10.  , ...,    nan,    nan,    nan],
        [ 55.  ,  50.  ,  35.  , ...,    nan,    nan,    nan],
        ...,
        [ 40.  ,  40.  ,  40.  , ...,    nan,    nan,    nan],
        [-40.  , -40.  , -40.  , ...,    nan,    nan,    nan],
        [  0.  ,   0.  , -10.  , ...,    nan,    nan,    nan]]],
      dtype=float32)

Performing on the fly data engineering and/or QC

The tabular class comes equipped with entry point to perform data engineering tasks on each individual file. The parameters engineer_meta, engineer_wave, and engineer_median take user define function/callables which can alter the data passing through these objects. The wave and median callables furthermore have access to the metadata by internally passing meta_dict to the function kwargs. The meta callable is applied first so that the updated metadata can be used in the engineer_wave or engineer_median functions - for example to confirm or reorder the leads. These entry points can allow for extensive pre-processing of both the signal data and metadata - essentially anything one can do on a single file can be conducted at this stage.

To illustrate the functionality we will create a custom function that ensures all signal data has a resolution of 5 uV. This example performs a single adjustment, in real application one might want to chain multiple steps together. The engineer_meta functions can be used to perform additional validation steps, for example filtering out DICOM files with specific software versions.

[9]:
# User define function to scale ECG signal to 5 Uv.
def signal_standardise_res(
    signals:dict[str, np.ndarray], target_resolution:float = 5.,
    verbose:bool=False, **kwargs,
) -> dict[str, np.ndarray]:
    """
    Standardised the resolution signal.

    Parameters
    ----------
    signals : `dict` [`str`, `np.ndarray`]
        A dictionary mapping channel names (strings) to waveform arrays.
    target_resolution : `float`, default 5
        The target resolution.
    verbose : `bool`, default False
        If True, prints additional debug information about the correction
        process.
    **kwargs
        Additional keyword arguments, which must include a dictionary under
        the key `meta_dict`. This dictionary should contain:
        - wave_channel_sens : float
            The wave channel sensitivity/resolution for channel i.

    Returns
    -------
    dict [`str`, `np.ndarray`]
        The input signals dictionary with corrected signals.

    Raises
    ------
    KeyError
        If `meta_dict` is not found in **kwargs.
    """
    # constants - these values should be in meta_dict
    sens = 'wave_channel_sens_'
    # the algorithm
    if not 'meta_dict' in kwargs:
        raise KeyError("meta_dict should be included as kwargs")
    else:
        meta_dict = kwargs['meta_dict']
    for i, (k, v) in enumerate(signals.items()):
        # skip of None
        if v is None:
            signals[k] = v
            continue
        # confirming this is a np.array
        if not isinstance(v, np.ndarray):
            raise ValueError("signals should be supplied as np.ndarray's")
        if verbose:
            if target_resolution/meta_dict[sens+str(i)] != 1.0:
                print(f'Scaling lead `{k}` by factor '
                      f'`{target_resolution/meta_dict[sens+str(i)]}`.',
                      file=sys.stdout
                      )
        signals[k] = signal_resolution(
            v, resolution_current=meta_dict[sens+str(i)],
            resolution_target=target_resolution,
        )
    return signals

[10]:
reader = ECGDICOMReader()
ecgtable = ECGTable(reader, path_list=dicom_path_list,
                    engineer_wave=signal_standardise_res, engineer_median=signal_standardise_res)
table = ecgtable().get_table(parsed_config=parser_dicom, )
# Note the voltage values of the first file are affected - the second file already have a resolution of 5.
table.WaveForms
[10]:
key Sampling sequence Lead Voltage Signal type
0 2.25.269796857626990821315969488216511468638 0 I 0.0 WaveForms
1 2.25.269796857626990821315969488216511468638 1 I 0.0 WaveForms
2 2.25.269796857626990821315969488216511468638 2 I 30.0 WaveForms
3 2.25.269796857626990821315969488216511468638 3 I 45.0 WaveForms
4 2.25.269796857626990821315969488216511468638 4 I 50.0 WaveForms
... ... ... ... ... ...
119995 1.3.6.1.4.1.40744.65.2218354498692802781165553... 4995 aVR -195.0 WaveForms
119996 1.3.6.1.4.1.40744.65.2218354498692802781165553... 4996 aVR -190.0 WaveForms
119997 1.3.6.1.4.1.40744.65.2218354498692802781165553... 4997 aVR -190.0 WaveForms
119998 1.3.6.1.4.1.40744.65.2218354498692802781165553... 4998 aVR -185.0 WaveForms
119999 1.3.6.1.4.1.40744.65.2218354498692802781165553... 4999 aVR -185.0 WaveForms

120000 rows × 5 columns