Processing ECG data to tabular format
By mapping ECG data (from DICOM, XML, or from an alternative format) to an reader API we can process multiple files to tabular format. Here we will show how ECGprocess can be used to flexibly combine MetaData, WaveFroms, and MedianBeats data, for example to .tsv, .npz or tfrecord.
In the following we will illustrate the core functionality of module. First we will import the relevant functions and classes, as well as some example data.
[1]:
import os
import tempfile
import pandas as pd
import numpy as np
from ecgprocess.tabular import (
ECGTable,
)
from ecgprocess.constants import (
CoreData as Core,
)
from ecgprocess.process_dicom import(
ECGDICOMReader,
)
from ecgprocess.process_xml import(
ECGXMLReader,
)
from ecgprocess.example_data.examples import (
parsed_config,
list_dicom_paths,
)
from ecgprocess.utils.general import (
list_tar,
)
from ecgprocess.utils.ecg_tools import(
signal_resolution,
)
# #### Relevant paths
dicom_path_list = [
list_dicom_paths()['example_dicom_1'],
list_dicom_paths()['example_dicom_2'],
]
# #### parsed config files
parser_dicom = parsed_config()['parsed_dicom1']
Mapping ECG data to pandas.DataFrame
As an illustration we will map multiple ECG data to pandas.DataFrame’s.
[2]:
reader = ECGDICOMReader()
ecgtable = ECGTable(reader, path_list=dicom_path_list)
table = ecgtable().get_table(parsed_config=parser_dicom,)
# rotating the table to improve presentation
table.MetaData.T
[2]:
| 2.25.269796857626990821315969488216511468638 | 1.3.6.1.4.1.40744.65.221835449869280278116555339248567804126 | |
|---|---|---|
| unique identifier | 2.25.269796857626990821315969488216511468638 | 1.3.6.1.4.1.40744.65.2218354498692802781165553... |
| number of leads | 12 | 12 |
| resolution unit (waveforms) | uV | uV |
| resolution unit (medianbeats) | uV | uV |
| resolution (waveforms) | 4.88 | 5.0 |
| resolution (medianbeats) | 4.88 | 5.0 |
| sampling frequency (original) | 500.0 | 500.0 |
| sampling frequency unit | None | None |
| sampling number (waveforms) | 5000 | 5000 |
| sampling number (medianbeats) | 600 | 512 |
| Softwave version | [1.02 SP03, MUSE_9.0.9.18167] | None |
| Manufacturer | GE Healthcare | GE |
| Model name | MV360 | MAC55 |
| wave_channel_sens_0 | 4.88 | 5.0 |
| wave_channel_sens_1 | 4.88 | 5.0 |
| wave_channel_sens_2 | 4.88 | 5.0 |
| wave_channel_sens_3 | 4.88 | 5.0 |
| wave_channel_sens_4 | 4.88 | 5.0 |
| wave_channel_sens_5 | 4.88 | 5.0 |
| wave_channel_sens_6 | 4.88 | 5.0 |
| wave_channel_sens_7 | 4.88 | 5.0 |
| wave_channel_sens_8 | 4.88 | 5.0 |
| wave_channel_sens_9 | 4.88 | 5.0 |
| wave_channel_sens_10 | 4.88 | 5.0 |
| wave_channel_sens_11 | 4.88 | 5.0 |
| wave_channel_correctionfactor_0 | 1.0 | 1.0 |
| wave_channel_correctionfactor_1 | 1.0 | 1.0 |
| wave_channel_correctionfactor_2 | 1.0 | 1.0 |
| wave_channel_correctionfactor_3 | 1.0 | 1.0 |
| wave_channel_correctionfactor_4 | 1.0 | 1.0 |
| wave_channel_correctionfactor_5 | 1.0 | 1.0 |
| wave_channel_correctionfactor_6 | 1.0 | 1.0 |
| wave_channel_correctionfactor_7 | 1.0 | 1.0 |
| wave_channel_correctionfactor_8 | 1.0 | 1.0 |
| wave_channel_correctionfactor_9 | 1.0 | 1.0 |
| wave_channel_correctionfactor_10 | 1.0 | 1.0 |
| wave_channel_correctionfactor_11 | 1.0 | 1.0 |
| wave_channel_baseline_0 | 0.0 | 0.0 |
| wave_channel_baseline_1 | 0.0 | 0.0 |
| wave_channel_baseline_2 | 0.0 | 0.0 |
| wave_channel_baseline_3 | 0.0 | 0.0 |
| wave_channel_baseline_4 | 0.0 | 0.0 |
| wave_channel_baseline_5 | 0.0 | 0.0 |
| wave_channel_baseline_6 | 0.0 | 0.0 |
| wave_channel_baseline_7 | 0.0 | 0.0 |
| wave_channel_baseline_8 | 0.0 | 0.0 |
| wave_channel_baseline_9 | 0.0 | 0.0 |
| wave_channel_baseline_10 | 0.0 | 0.0 |
| wave_channel_baseline_11 | 0.0 | 0.0 |
| duration (sec) | 10.0 | 10.0 |
| sampling frequency (processed) | None | None |
| key | 2.25.269796857626990821315969488216511468638 | 1.3.6.1.4.1.40744.65.2218354498692802781165553... |
[3]:
# printing waveform data
table.WaveForms
[3]:
| key | Sampling sequence | Lead | Voltage | Signal type | |
|---|---|---|---|---|---|
| 0 | 2.25.269796857626990821315969488216511468638 | 0 | I | 0.00 | WaveForms |
| 1 | 2.25.269796857626990821315969488216511468638 | 1 | I | 0.00 | WaveForms |
| 2 | 2.25.269796857626990821315969488216511468638 | 2 | I | 29.28 | WaveForms |
| 3 | 2.25.269796857626990821315969488216511468638 | 3 | I | 43.92 | WaveForms |
| 4 | 2.25.269796857626990821315969488216511468638 | 4 | I | 48.80 | WaveForms |
| ... | ... | ... | ... | ... | ... |
| 119995 | 1.3.6.1.4.1.40744.65.2218354498692802781165553... | 4995 | aVR | -195.00 | WaveForms |
| 119996 | 1.3.6.1.4.1.40744.65.2218354498692802781165553... | 4996 | aVR | -190.00 | WaveForms |
| 119997 | 1.3.6.1.4.1.40744.65.2218354498692802781165553... | 4997 | aVR | -190.00 | WaveForms |
| 119998 | 1.3.6.1.4.1.40744.65.2218354498692802781165553... | 4998 | aVR | -185.00 | WaveForms |
| 119999 | 1.3.6.1.4.1.40744.65.2218354498692802781165553... | 4999 | aVR | -185.00 | WaveForms |
120000 rows × 5 columns
Writing to ECG data to tables
As an illustration we will write ECG data to the individual files: MetaData.tsv, WaveForms.tsv, and MedianBeats.tsv.
[4]:
with tempfile.TemporaryDirectory() as temp_dir:
os.chdir(temp_dir)
ecgreader = ECGDICOMReader()
ecgtable = ECGTable(ecgreader, path_list=dicom_path_list, )
res = ecgtable(verbose=False).write_ecg(parsed_config=parser_dicom, write_failed=False)
print('Directory content:')
for file_name in os.listdir(getattr(res, 'target_path')):
print(' - {}'.format(file_name))
# reading the data in - the temp dir will otherwise disappear
table = pd.read_csv(os.path.join(getattr(res, 'target_path'), 'metadata.tsv.gz'),
sep='\t', index_col=0,)
Directory content:
- waveforms.tsv.gz
- metadata.tsv.gz
- medianbeats.tsv.gz
[5]:
table.T
[5]:
| 2.25.269796857626990821315969488216511468638 | 1.3.6.1.4.1.40744.65.221835449869280278116555339248567804126 | |
|---|---|---|
| unique identifier | 2.25.269796857626990821315969488216511468638 | 1.3.6.1.4.1.40744.65.2218354498692802781165553... |
| number of leads | 12 | 12 |
| resolution unit (waveforms) | uV | uV |
| resolution unit (medianbeats) | uV | uV |
| resolution (waveforms) | 4.88 | 5.0 |
| resolution (medianbeats) | 4.88 | 5.0 |
| sampling frequency (original) | 500.0 | 500.0 |
| sampling frequency unit | NaN | NaN |
| sampling number (waveforms) | 5000 | 5000 |
| sampling number (medianbeats) | 600 | 512 |
| Softwave version | ['1.02 SP03', 'MUSE_9.0.9.18167'] | NaN |
| Manufacturer | GE Healthcare | GE |
| Model name | MV360 | MAC55 |
| wave_channel_sens_0 | 4.88 | 5.0 |
| wave_channel_sens_1 | 4.88 | 5.0 |
| wave_channel_sens_2 | 4.88 | 5.0 |
| wave_channel_sens_3 | 4.88 | 5.0 |
| wave_channel_sens_4 | 4.88 | 5.0 |
| wave_channel_sens_5 | 4.88 | 5.0 |
| wave_channel_sens_6 | 4.88 | 5.0 |
| wave_channel_sens_7 | 4.88 | 5.0 |
| wave_channel_sens_8 | 4.88 | 5.0 |
| wave_channel_sens_9 | 4.88 | 5.0 |
| wave_channel_sens_10 | 4.88 | 5.0 |
| wave_channel_sens_11 | 4.88 | 5.0 |
| wave_channel_correctionfactor_0 | 1.0 | 1.0 |
| wave_channel_correctionfactor_1 | 1.0 | 1.0 |
| wave_channel_correctionfactor_2 | 1.0 | 1.0 |
| wave_channel_correctionfactor_3 | 1.0 | 1.0 |
| wave_channel_correctionfactor_4 | 1.0 | 1.0 |
| wave_channel_correctionfactor_5 | 1.0 | 1.0 |
| wave_channel_correctionfactor_6 | 1.0 | 1.0 |
| wave_channel_correctionfactor_7 | 1.0 | 1.0 |
| wave_channel_correctionfactor_8 | 1.0 | 1.0 |
| wave_channel_correctionfactor_9 | 1.0 | 1.0 |
| wave_channel_correctionfactor_10 | 1.0 | 1.0 |
| wave_channel_correctionfactor_11 | 1.0 | 1.0 |
| wave_channel_baseline_0 | 0.0 | 0.0 |
| wave_channel_baseline_1 | 0.0 | 0.0 |
| wave_channel_baseline_2 | 0.0 | 0.0 |
| wave_channel_baseline_3 | 0.0 | 0.0 |
| wave_channel_baseline_4 | 0.0 | 0.0 |
| wave_channel_baseline_5 | 0.0 | 0.0 |
| wave_channel_baseline_6 | 0.0 | 0.0 |
| wave_channel_baseline_7 | 0.0 | 0.0 |
| wave_channel_baseline_8 | 0.0 | 0.0 |
| wave_channel_baseline_9 | 0.0 | 0.0 |
| wave_channel_baseline_10 | 0.0 | 0.0 |
| wave_channel_baseline_11 | 0.0 | 0.0 |
| duration (sec) | 10.0 | 10.0 |
| sampling frequency (processed) | NaN | NaN |
| key | 2.25.269796857626990821315969488216511468638 | 1.3.6.1.4.1.40744.65.2218354498692802781165553... |
Writing to ECG data to numpy files
Next we will write ECG data to numpy files. Given that numpy predominantly works with numerical data, non-numerical data will automatically be removed. Note that the header data of this reduced metadata array is included in the output.
[6]:
with tempfile.TemporaryDirectory() as temp_dir:
os.chdir(temp_dir)
ecgreader = ECGDICOMReader()
ecgtable = ECGTable(ecgreader, path_list=dicom_path_list, )
res = ecgtable(verbose=False).write_ecg(parsed_config=parser_dicom, write_failed=False,
file_type='numpy')
print('Directory content:')
for file_name in os.listdir(getattr(res, 'target_path')):
print(' - {}'.format(file_name))
# reading the data in - the temp dir will otherwise disappear
with np.load(os.path.join(getattr(res, 'target_path'), 'ecg_data.npz')) as data:
meta = data['MetaData']
median = data['MedianBeats']
Directory content:
- ecg_data.npz
- header_metadata.txt
[7]:
# The metadata
meta
[7]:
array([[1.20e+01, 4.88e+00, 4.88e+00, 5.00e+02, 5.00e+03, 6.00e+02,
4.88e+00, 4.88e+00, 4.88e+00, 4.88e+00, 4.88e+00, 4.88e+00,
4.88e+00, 4.88e+00, 4.88e+00, 4.88e+00, 4.88e+00, 4.88e+00,
1.00e+00, 1.00e+00, 1.00e+00, 1.00e+00, 1.00e+00, 1.00e+00,
1.00e+00, 1.00e+00, 1.00e+00, 1.00e+00, 1.00e+00, 1.00e+00,
0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
1.00e+01],
[1.20e+01, 5.00e+00, 5.00e+00, 5.00e+02, 5.00e+03, 5.12e+02,
5.00e+00, 5.00e+00, 5.00e+00, 5.00e+00, 5.00e+00, 5.00e+00,
5.00e+00, 5.00e+00, 5.00e+00, 5.00e+00, 5.00e+00, 5.00e+00,
1.00e+00, 1.00e+00, 1.00e+00, 1.00e+00, 1.00e+00, 1.00e+00,
1.00e+00, 1.00e+00, 1.00e+00, 1.00e+00, 1.00e+00, 1.00e+00,
0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
1.00e+01]])
[8]:
# The medianbeats - note that the second entry does not have data for all the leads
median
[8]:
array([[[ -4.88, -4.88, -4.88, ..., 43.92, 53.68, 0. ],
[ 39.04, 34.16, 34.16, ..., 102.48, 102.48, 0. ],
[ 43.92, 39.04, 39.04, ..., 58.56, 48.8 , 0. ],
...,
[ 43.92, 43.92, 39.04, ..., 24.4 , 19.52, 0. ],
[ 39.04, 39.04, 39.04, ..., 34.16, 34.16, 0. ],
[ 29.28, 29.28, 29.28, ..., 39.04, 34.16, 0. ]],
[[-30. , -30. , -20. , ..., nan, nan, nan],
[ 20. , 15. , 10. , ..., nan, nan, nan],
[ 55. , 50. , 35. , ..., nan, nan, nan],
...,
[ 40. , 40. , 40. , ..., nan, nan, nan],
[-40. , -40. , -40. , ..., nan, nan, nan],
[ 0. , 0. , -10. , ..., nan, nan, nan]]],
dtype=float32)
Performing on the fly data engineering and/or QC
The tabular class comes equipped with entry point to perform data engineering tasks on each individual file. The parameters engineer_meta, engineer_wave, and engineer_median take user define function/callables which can alter the data passing through these objects. The wave and median callables furthermore have access to the metadata by internally passing meta_dict to the function kwargs. The meta callable is applied first so that the updated metadata can be used in the
engineer_wave or engineer_median functions - for example to confirm or reorder the leads. These entry points can allow for extensive pre-processing of both the signal data and metadata - essentially anything one can do on a single file can be conducted at this stage.
To illustrate the functionality we will create a custom function that ensures all signal data has a resolution of 5 uV. This example performs a single adjustment, in real application one might want to chain multiple steps together. The engineer_meta functions can be used to perform additional validation steps, for example filtering out DICOM files with specific software versions.
[9]:
# User define function to scale ECG signal to 5 Uv.
def signal_standardise_res(
signals:dict[str, np.ndarray], target_resolution:float = 5.,
verbose:bool=False, **kwargs,
) -> dict[str, np.ndarray]:
"""
Standardised the resolution signal.
Parameters
----------
signals : `dict` [`str`, `np.ndarray`]
A dictionary mapping channel names (strings) to waveform arrays.
target_resolution : `float`, default 5
The target resolution.
verbose : `bool`, default False
If True, prints additional debug information about the correction
process.
**kwargs
Additional keyword arguments, which must include a dictionary under
the key `meta_dict`. This dictionary should contain:
- wave_channel_sens : float
The wave channel sensitivity/resolution for channel i.
Returns
-------
dict [`str`, `np.ndarray`]
The input signals dictionary with corrected signals.
Raises
------
KeyError
If `meta_dict` is not found in **kwargs.
"""
# constants - these values should be in meta_dict
sens = 'wave_channel_sens_'
# the algorithm
if not 'meta_dict' in kwargs:
raise KeyError("meta_dict should be included as kwargs")
else:
meta_dict = kwargs['meta_dict']
for i, (k, v) in enumerate(signals.items()):
# skip of None
if v is None:
signals[k] = v
continue
# confirming this is a np.array
if not isinstance(v, np.ndarray):
raise ValueError("signals should be supplied as np.ndarray's")
if verbose:
if target_resolution/meta_dict[sens+str(i)] != 1.0:
print(f'Scaling lead `{k}` by factor '
f'`{target_resolution/meta_dict[sens+str(i)]}`.',
file=sys.stdout
)
signals[k] = signal_resolution(
v, resolution_current=meta_dict[sens+str(i)],
resolution_target=target_resolution,
)
return signals
[10]:
reader = ECGDICOMReader()
ecgtable = ECGTable(reader, path_list=dicom_path_list,
engineer_wave=signal_standardise_res, engineer_median=signal_standardise_res)
table = ecgtable().get_table(parsed_config=parser_dicom, )
# Note the voltage values of the first file are affected - the second file already have a resolution of 5.
table.WaveForms
[10]:
| key | Sampling sequence | Lead | Voltage | Signal type | |
|---|---|---|---|---|---|
| 0 | 2.25.269796857626990821315969488216511468638 | 0 | I | 0.0 | WaveForms |
| 1 | 2.25.269796857626990821315969488216511468638 | 1 | I | 0.0 | WaveForms |
| 2 | 2.25.269796857626990821315969488216511468638 | 2 | I | 30.0 | WaveForms |
| 3 | 2.25.269796857626990821315969488216511468638 | 3 | I | 45.0 | WaveForms |
| 4 | 2.25.269796857626990821315969488216511468638 | 4 | I | 50.0 | WaveForms |
| ... | ... | ... | ... | ... | ... |
| 119995 | 1.3.6.1.4.1.40744.65.2218354498692802781165553... | 4995 | aVR | -195.0 | WaveForms |
| 119996 | 1.3.6.1.4.1.40744.65.2218354498692802781165553... | 4996 | aVR | -190.0 | WaveForms |
| 119997 | 1.3.6.1.4.1.40744.65.2218354498692802781165553... | 4997 | aVR | -190.0 | WaveForms |
| 119998 | 1.3.6.1.4.1.40744.65.2218354498692802781165553... | 4998 | aVR | -185.0 | WaveForms |
| 119999 | 1.3.6.1.4.1.40744.65.2218354498692802781165553... | 4999 | aVR | -185.0 | WaveForms |
120000 rows × 5 columns