Contents lists available at ScienceDirect
Journal of Advanced Research
Original article
Advanced DNA fingerprint genotyping based on a model developed from
real chip electrophoresis data
Helena Skutkovaa,, Martin Viteka, Matej Bezdicekb, Eva Brhelovab, Martina Lengerovab
a Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech Republic
b Department of Internal Medicine, Hematology and Oncology, Masaryk University and University Hospital Brno, Cernopolni 212/9, 662 63 Brno, Czech Republic
h i g h l i g h t s
g r a p h i c a l
a b s t r a c t
 Mapping chip electrophoresis
distortion based on real data
 Determining the transformation
function for the adaptive correction of
band size deviation.
 Improving the ability to distinguish
closely related DNA fingerprints.
 Using hierarchical clustering to adjust
the global band position.
 Genotyping all DNA fingerprints from
multiple runs at once.
a r t i c l e
i n f o
a b s t r a c t
Article history:
Large-scale comparative studies of DNA fingerprints prefer automated chip capillary electrophoresis over
Received 19 October 2018
Revised 6 January 2019
Accepted 10 January 2019
Available online 25 January 2019
conventional gel planar electrophoresis due to the higher precision of the digitalization process. However,
the determination of band sizes is still limited by the device resolution and sizing accuracy. Band match-
ing, therefore, remains the key step in DNA fingerprint analysis. Most current methods evaluate only the
pairwise similarity of the samples, using heuristically determined constant thresholds to evaluate the
DNA fingerprinting
Automated chip capillary electrophoresis
Band matching
maximum allowed band size deviation; unfortunately, that approach significantly reduces the ability
to distinguish between closely related samples. This study presents a new approach based on global mul-
tiple alignments of bands of all samples, with an adaptive threshold derived from the detailed migration
analysis of a large number of real samples. The proposed approach allows the accurate automated anal-
ysis of DNA fingerprint similarities for extensive epidemiological studies of bacterial strains, thereby
Gel sample distortion
helping to prevent the spread of dangerous microbial infections.
Pattern recognition
 2019 The Authors. Published by Elsevier B.V. on behalf of Cairo University. This is an open access article
Abbreviations: DBSCAN, density-based spatial clustering of applications with
noise; DTW, dynamic time warping; ESBL, extended spectrum beta-lactamases;
KLPN, Klebsiella pneumonia; MALDI-TOF, matrix assisted laser desorption ionization
– time of flight; rep-PCR, repetitive element palindromic polymerase chain
DNA fingerprinting methods are commonly used for typing
bacterial strains, and electrophoretic separation methods are used
for visualizing and evaluating the amplification results. Although
standard planar electrophoresis (on an agarose gel) is still more
reaction; RMSE, root mean squared error; R-square, ratio of the sum of squares;
SD, standard deviation; SLINK, single linkage; SSE, sum of squares due to error;
UPGMA, unweighted pair group method with arithmetic mean.
Peer review under responsibility of Cairo University.
Corresponding author.
E-mail address: (H. Skutkova).
commonly used than its automated equivalents, the popularity of
modern automated chip electrophoresis is increasing, especially
in the case of extensive comparative studies [1–4]. The main
advantages are the elimination of the gel image digitalization
process, the absence of sample distortion caused by the non-
2090-1232/ 2019 The Authors. Published by Elsevier B.V. on behalf of Cairo University.
This is an open access article under the CC BY-NC-ND license (
H. Skutkova et al./Journal of Advanced Research 18 (2019) 9–18
homogeneity of the electromagnetic field (smile effect), the simple
purpose, a large number of DNA weight markers were measured
adaptation of sample ranges from multiple electrophoretic runs,
to confirm that the dependence between band size deviation (shift)
and the increased speed of the electrophoretic runs. Thus, the size
and band size (band position) is not constant or linear. Based on
of the DNA fragments can be obtained directly by using objective
these measurements, an empirical model of band size deviation
software analysis, in contrast to subjective estimates of the size
was derived, which serves as a transformation function that adapts
from a low-quality image by a human operator. However, even
band size deviation to an approximately constant value across the
automated chip electrophoresis has limited accuracy. For example,
measured range. It enables the use of hierarchical cluster analysis
the Agilent 2100 Bioanalyzer System provides catalogue values of
with one fixed threshold to identify bands of the same size in all
±10 or ±15% sizing accuracy, depending on the kits and reagents
samples without a pre-defined number of clusters or of objects
used. The sizing resolution is also limited and dependent on the
in the clusters. The identification accuracy of the same bands
sizing range; for the DNA 7500 Kit from Agilent, the resolution is
was also verified on DNA weight markers, where the correct band
5% in the 100–1,000 bp range and 15% in the 1,000–7,500 bp range.
size values are known. The designed method was finally tested on
Thus, the resulting fragment size values are not completely accu-
the study of the repetitive element palindromic polymerase chain
rate, and their deviation is not constant over the measured range.
reaction (rep-PCR) genotyping of 60 bacterial strains and compar-
Although the deviation is smaller than that obtained in the subjec-
ison with the standard professional tool, the fingerprint data mod-
tive estimation of size from standard planar electrophoresis gel
ule in BioNumerics.
images, its existence and inconsistency still complicate subsequent
comparative analyses, such as phylogeny reconstruction. The basis
of these methods is the evaluation of the similarity between two
Material and methods
sample lines (fingerprint patterns), depending on the presence/
absence of bands of the same size. It is difficult to assess whether
Problem description
two bands are the same or belong to two different bands corre-
sponding to various lengths of DNA fragments due to the inaccu-
racy in measurements. This problem has not been addressed, as
evidenced by the lack of information in the literature.
The first reason is that planar electrophoresis is more com-
monly used than chip electrophoresis because the former is less
expensive. Thus, DNA fingerprint gel images are still being anal-
ysed using tools, such as PyElph [5], GelClust [6], and GelJ [7], that
focus primarily on image preprocessing tasks [8,9]. The similarity
of two bands is evaluated trivially. Most often, the bands are iden-
tified as the same size if their deviation does not exceed the per-
mitted constant threshold. The identification of bands of the
same size or their alignment is generally performed using pairwise
alignment. A more advanced solution can be found in the software
GELect [10], where a density-based clustering method (DBSCAN) is
used to identify band cluster centroids from all samples; however,
it still uses a heuristically set constant threshold. Moreover,
another decision parameter, the minimum number of samples con-
taining bands, causes incorrect classification of unique samples.
Another way to adapt band positions in gel images obtained from
classic planar electrophoresis is the use of the dynamic time warp-
ing (DTW) method, which adaptively re-samples 1D signal repre-
sentations of particular lines [11]. This method does not use a
constant threshold for band position correction but requires a com-
plete signal representation from raw data.
The second reason for the insufficient examination of the band
alignment in chip electrophoresis is that the processing of chip
electrophoresis DNA fingerprinting data is almost exclusively real-
ized through complex and expensive software platforms, such as
BioNumerics (Fingerprint Data module or DiversiLab genotyping
application distributed by Applied Maths NV, BioMérieux, France).
These tools are copyrighted, and the principle of the methods used
is not publicly available. According to the technical documentation
from the company’s website (, the
fingerprint data module uses a combination of nonlinear shift with
fixed edges and global shift with linear stretch/compression for
band position correction. Although the procedure is not described
in detail, the shift correction is based on finding the highest corre-
lation between samples. Since correlation describes the degree of
The principle of the method for the global detection of the same
size bands in all gel samples is composed of two key steps. The first
step is the removal of the nonlinear dependence of band size devi-
ations on the band size range. Samples with known DNA fragment
sizes were used to describe true accuracy in band size determina-
tion. DNA weight markers (ladders) appeared to be appropriate for
that purpose. However, during the first measurement of one ladder
type (12 samples of GeneRuler 1 kb DNA Ladder) in one run, con-
siderable variation was observed in sizes corresponding to the
same size band (Fig. 1). A regular user may not be aware of this
variance, because it is not highly noticeable in an artificial gel
image with a logarithmic scale (Fig. 1a) as produced by the soft-
ware supplied to the chip electrophoresis device (2100 Bioanalyzer
Expert Software distributed by Agilent Technology, Inc., Santa
Clara, California, USA). An illustration of the band positions in a
graph with a linear scale band size axis (Fig. 1b) more clearly
shows the variability of the same size bands. Detailed images of
the four different band size levels (Fig. 1c, d) and their statistical
evaluation (Fig. 1e) prove that the variance in band size is not con-
stant across the whole sample range and varies even between indi-
vidual samples. The measurements were performed with different
ladder types (different size ranges) and with variable distributions
of samples across several runs to reveal the maximum degree of
band size variability.
The second step of the proposed method is global identification
of the same size bands on the whole gel at once, instead of by indi-
vidual local pairwise sample comparison. This step also allows us
to obtain a corrected gel image (graphic representation of band
sizes), where the ‘‘correct” band position is determined as the med-
ian size of the bands identified as the same. This process of posi-
tional adaptation of the same size bands in multiple samples is
comparable to multiple sequence alignment [12,13], known for
its application to symbolic DNA representations of protein
sequences or genomic signals [14]. It is a necessary step preceding
the subsequent phylogenetic analysis of biological sequences
[15–17]. Therefore, global multiple alignments of band positions
are a suitable step preceding the comparative analysis of gel
samples, such as the genotyping of bacterial rep-PCR profiles.
linear dependence, correlation is expected between the deviation
and band size. However, it can be assumed that the character of
the dependence is not linear, because the sample mobility on the
gel is not linearly dependent on band size.
All data used in this article were obtained by chip capillary
In this study, a new method for the global alignment of the band
electrophoresis using the 2100 Bioanalyzer platform. All reactions
positions using an adaptive threshold is presented. For this