Preventing the data acquisition process from taking place
4.1 |
|
73 |
---|---|---|
p(K | Xobs,Xmiss,ξ) = p(K | Xobs,ξ) . | (4.4) | |
Examples of data MAR occur in the following cases: while answering questions |
in a survey, project managers may skip those related to small projects more often than those related to larger projects, because they may remember less details about smaller projects [154]; a sensor occasionally fails due to power outages, preventing the data acquisition process from taking place [68]. In both cases, the cause for the missingness is not directly tied to the variables containing the MVs but rather to other external influences. In the first case the MAR assumption can apply, because the predictor “project size” explains the likelihood of the value to be missing [154]. Similarly, the power outages in the second case explain why the sensor data is missing.
The alternative for data MAR or MCAR is to consider that the data is NMAR. This is the case when the pattern of data missingness depends on the missing variables themselves. A typical example of data NMAR is in the case of a personal survey involving private questions, whose nature will most likely leave them unanswered. In this scenario, unless the survey can reliably measure variables that are strongly related to those containing MVs, the MAR and MCAR assumptions are violated and we must consider that data is NMAR [154].
When data is NMAR valuable information is lost and there is no general method for handling MVs properly. Otherwise, the missing data mechanism is termed ignorable and its cause can simply be ignored, allowing the simplification of the methods for handling MVs [68].
Fig. 4.1 Overview of the types of techniques for handling MVs in ML
Since many algorithms cannot directly handle MVs, a common practice is to rely on data preprocessing techniques. Usually, this is accomplished by using imputation or simply by removing instances (case deletion) and/or features containing MVs [68, 228, 142, 154, 116, 10, 105]. A review of the methods and techniques to deal with this problem, including a comparison of some well-known approaches, can be found in Laencina et al. [68].