de 8
Vista actual
c
o
m
p
u
t
e
r
m
e
t
h
o
d
s
a
n
d
p
r
o
g
r
a
m
s
i
n
b
i
o
m
e
d
i
c
i
n
e
1
0
8
(
2
0
1
2
)
1247–1254
jo
ur
n
al
hom
ep
age
:
www.intl.elsevierhealth.com/journals/cmpb
WIMP:
Web
server
tool
for
missing
data
imputation
D.
Urda
a
,
,
J.L.
Subirats
a
,
P.J.
García-Laencina
b
,
L.
Franco
a
,
J.L.
Sancho-Gómez
c
,
J.M.
Jerez
a
a
Departamento
de
Lenguajes
y
Ciencias
de
la
Computación,
ETSI
Informática,
University
of
Málaga,
Spain
b
Centro
Universitario
de
la
Defensa
de
San
Javier,
MDE-UPCT,
Spain
c
Departamento
de
Tecnologías
de
la
Información
y
las
Comunicaciones,
Universidad
Politécnica
de
Cartagena,
Spain
a
r
t
i
c
l
e
i
n
f
o
Article
history:
Received
11
November
2011
Received
in
revised
form
30
July
2012
Accepted
13
August
2012
Keywords:
Imputation
Missing
data
Machine
learning
Web
application
a
b
s
t
r
a
c
t
The
imputation
of
unknown
or
missing
data
is
a
crucial
task
on
the
analysis
of
biomedical
datasets.
There
are
several
situations
where
it
is
necessary
to
classify
or
identify
instances
given
incomplete
vectors,
and
the
existence
of
missing
values
can
much
degrade
the
per-
formance
of
the
algorithms
used
for
the
classification/recognition.
The
task
of
learning
accurately
from
incomplete
data
raises
a
number
of
issues
some
of
which
have
not
been
completely
solved
in
machine
learning
applications.
In
this
sense,
effective
missing
value
estimation
methods
are
required.
Different
methods
for
missing
data
imputations
exist
but
most
of
the
times
the
selection
of
the
appropriate
technique
involves
testing
several
meth-
ods,
comparing
them
and
choosing
the
right
one.
Furthermore,
applying
these
methods,
in
most
cases,
is
not
straightforward,
as
they
involve
several
technical
details,
and
in
particu-
lar
in
cases
such
as
when
dealing
with
microarray
datasets,
the
application
of
the
methods
requires
huge
computational
resources.
As
far
as
we
know,
there
is
not
a
public
software
application
that
can
provide
the
computing
capabilities
required
for
carrying
the
task
of
data
imputation.
This
paper
presents
a
new
public
tool
for
missing
data
imputation
that
is
attached
to
a
computer
cluster
in
order
to
execute
high
computational
tasks.
The
software
WIMP
(Web
IMPutation)
is
a
public
available
web
site
where
registered
users
can
create,
execute,
analyze
and
store
their
simulations
related
to
missing
data
imputation.
©
2012
Elsevier
Ireland
Ltd.
All
rights
reserved.
1. Introduction
In
the
fields
of
pattern
recognition
and
machine
learning,
it
is
generally
accepted
that
important
factors
affecting
the
obtained
results
regarding
the
prediction
accuracy
or
percent-
age
of
correct
classification
are
the
quality
of
the
dataset
and
the
appropriate
selection
of
a
learning
algorithm.
Related
to
the
quality
of
the
data,
it
is
a
very
common
situation
to
find
incomplete
datasets
containing
unknown
or
missing
data.
http://www.icb.uma.es/wimp
.
Corresponding
author
.
Tel.:
+34
952132847.
E-mail
addresses:
durda@lcc.uma.es
(D.
Urda),
jlsubirats@lcc.uma.es
(J.L.
Subirats),
pedroj.garcia@cud.upct.es
(P.J.
García-Laencina),
lfranco@lcc.uma.es
(L.
Franco),
josel.sancho@upct.es
(J.L.
Sancho-Gómez),
jja@lcc.uma.es
(J.M.
Jerez).
URLs:
http://www.cud.upct.es/
(P.J.
García-Laencina),
http://www.lcc.uma.es
(J.M.
Jerez).
Several
strategies
or
methods
can
be
applied
to
deal
with
incomplete
datasets.
A
simple
and
common
strategy
is
to
ignore
missing
values,
reducing
the
size
of
the
useful
dataset.
Other
valid
approaches
to
deal
with
incomplete
datasets
tend
to
use
supervised
learning
or
statistical
analysis
to
impute
the
missing
data
so
as
to
use
the
total
number
of
samples
avail-
able
in
the
dataset
[1,9,16,18,25,27,23,22,21]
.
In
fact,
most
of
the
biomedical
studies
have
focussed
on
developing
missing
value
estimation
methods
for
incomplete
biomedical
or
microarray
datasets
[35,38,42,20,2,12,19,11,30,5,39,3,36,41,7,10]
.
0169-2607/$
see
front
matter
©
2012
Elsevier
Ireland
Ltd.
All
rights
reserved.
http://dx.doi.org/10.1016/j.cmpb.2012.08.006


1248
c
o
m
p
u
t
e
r
m
e
t
h
o
d
s
a
n
d
p
r
o
g
r
a
m
s
i
n
b
i
o
m
e
d
i
c
i
n
e
1
0
8
(
2
0
1
2
)
1247–1254
In
concrete,
in
the
area
of
bioinformatics,
there
are
several
research
groups
working
on
the
application
of
information
contained
in
microarray
datasets
to
classify
either
the
diag-
nosis
or
prognosis
of
certain
diseases
according
to
gene
expression
signatures
corresponding
to
patients’
samples.
Unfortunately,
a
common
problem
is
the
quality
of
microar-
ray
datasets
and
especially
those
studies
using
microarray
gene
expression
data.
Most
of
these
datasets
can
be
found
and
downloaded
from
public
websites
(Gene
Expression
Omnibus
1
or
Standford
Microarray
Databases
2
)
and,
in
most
cases,
they
contain
incomplete
samples
due
to
unknown
or
missing
data.
Incomplete
microarray
data
could
be
caused
by
administrative
errors,
defective
techniques
or
punctual
technology
failures.
A
common
approach
to
make
the
prognosis
of
a
cer-
tain
disease
is
the
use
of
machine
learning
algorithms,
and
as
mentioned
before,
the
performance
of
these
algorithms
strongly
depends
on
the
quality
of
the
data.
In
addition
to
this
inconvenient,
the
research
personnel
face
another
dis-
advantage
when
working
with
microarray
datasets:
the
few
number
of
available
instances,
approximately
around
40–80
samples.
Thus,
the
strategy
of
just
ignoring
missing
values
samples
in
microarray
datasets
is
not
recommended
because
of
the
small
number
of
samples
available
and
because
this
simple
technique
can
introduce
substantial
biases
in
the
study,
especially
when
missing
data
is
not
distributed
ran-
domly.
Therefore,
the
problematic
of
dealing
with
microarray
datasets
makes
the
imputation
of
unknown
or
missing
data
an
important
task
to
be
considered
when
applying
a
machine
learning
algorithm.
Quinlan
[24]
shows
that
missing
values
in
either
the
training
data
or
test
data
affect
prediction
accu-
racy
of
learning
classifiers.
Moreover,
Luque
et
al.
[17]
show
the
benefits
of
having
a
good
quality
dataset
instead
of
a
good
learning
algorithm.
Therefore,
the
imputation
of
unknown
or
missing
data
on
the
area
of
bioinformatics
arises
as
a
problem
to
be
dealt
especially
on
this
kind
of
datasets,
as
it
is
described
in
[37,13,14,33,26,4,32,15,31,28]
.
We
next
describe
some
of
the
standard
problems
that
arise
at
the
time
of
imputing
unknown
or
missing
data
on
a
research
study:
(i)
a
good
knowledge
of
the
different
imputation
meth-
ods
that
best
fits
the
type
of
dataset
used
in
the
study
should
be
acquired;
(ii)
a
valid
implementation
of
these
methods
should
be
available;
(iii)
the
application
of
the
selected
imputation
methods
may
need
heavy
computational
resources,
a
require-
ment
that
may
be
difficult
to
be
satisfied
for
small
research
groups.
Our
proposal
in
this
paper
is
to
try
to
help
to
overcome
these
points
above
described
by
providing
a
public
website
tool,
named
WIMP
(Web
IMPutation),
that
includes
several
imputation
methods
and
that
offers
to
the
scientific
commu-
nity
the
possibility
of
applying
them
to
the
dataset
involved
in
their
study.
Further,
WIMP
incorporates
on
its
backend
a
powerful
computational
cluster
enabling
users
to
execute
the
selected
imputation
methods
onto
a
previously
loaded
dataset
in
a
reasonable
amount
of
time.
1
http://www.ncbi.nlm.nih.gov/geo/
.
2
http://smd.stanford.edu/
.
Of
course,
widely
used
statistical
software,
as
SPSS
3
or
SAS,
4
also
incorporate
some
imputation
methods.
Neverthe-
less,
SPSS
and
SAS
are
commercial
software
with
license
keys
that
are
not
affordable
to
every
clinicians
and
researchers,
and,
overall,
these
tools
only
provide
imputation
methods
based
on
statistical
models
[40,34,8]
,
but
not
based
on
machine
learning
techniques.
The
present
paper
is
structured
as
follows.
Section
2
describes
the
system
architecture
of
WIMP,
including
descrip-
tions
of
the
database
used,
the
computational
cluster
that
resides
on
the
backend
of
WIMP
and
a
brief
description
of
the
imputation
methods
that
are
currently
available.
Section
3
presents
a
case
of
study
where
missing
data
imputation
is
needed
and
shows
how
it
can
be
solved
by
utilizing
the
avail-
able
resources
of
WIMP.
Finally,
in
Section
4
we
provide
some
conclusions
and
further
improvements
that
could
be
done
in
relationship
to
the
present
work.
2.
WIMP
application
The
WIMP
application
software
has
been
developed
using
the
latest.NET
technology
[29]
.
WIMP
is
a
public
website
applica-
tion
that
enables
users
to
log
in
after
being
registered
in
order
to
impute
missing
data
using
several
available
imputation
methods.
The
main
goal
of
WIMP
is
to
help
potential
users
so
that
they
can
focus
their
efforts
on
carrying
their
experiments
instead
of
dedicating
resources
for
searching,
understanding
and
applying
the
different
imputation
methods.
Also,
in
order
to
speed
up
the
simulations
involved
in
the
imputation
process,
on
the
backend
of
the
WIMP
application
(and
thanks
to
the
Computational
Intelligence
in
Biomedicine
research
group
of
the
University
of
Malaga),
a
computer
cluster
is
available
so
the
different
imputation
methods
can
be
exe-
cuted
relatively
fast.
Basically,
whenever
a
user
planifies
and
launches
a
new
simulation
through
the
website
environment,
it
is
actually
executed
on
the
computational
cluster
on
the
background
as
a
separate
process.
Finally,
and
once
the
pro-
cess
ends,
the
user
is
informed
via
email,
obtaining
the
results
of
the
imputation
method
applied
to
the
dataset
previously
loaded.
The
complete
system
consists
of:
A
website
application
developed
with
the
latest.NET
tech-
nology
providing
users
a
friendly
and
usable
environment
in
order
to
interact
with
the
system.
An
implementation
of
an
SQL
Server
2008
database
including
all
the
necessary
information
concerning
users,
projects,
simulations,
files,
etc.
A
computational
cluster
composed
by
27
quad-cores
nodes
PC’s
with
4
GB
of
RAM
each
one,
all
of
them
running
under
the
Linux
operating
system.
A
web
service
implementation
that
enables
the
commu-
nication
between
either
the
website
application
or
the
computationtal
cluster
and
the
system
database
to
grant
them
access
to
the
information
stored
in
it.
3
http://www-01.ibm.com/software/es/analytics/spss/
.
4
http://www.sas.com/
.