missing_value.md 20 KB


title: "缺省值处理"

output: html_document

knitr::opts_chunk$set(echo = TRUE)

基本概念

R中的缺失值(简称缺值)用NA(not available)表示。NA不是字符串也不是数指,是个代表缺值的符号。

x1 <- c(1, 4, 3, NA, 7)
x2 <- c("a", "B", NA, "NA")

x1中第四个元素为缺失值。x2中,第三个值为缺失值,第四个是字符串“NA”。 用is.na查看缺失值,is.na函数返回逻辑类型的数据,这里是vector。

is.na(x1)
is.na(x2)

R中缺失值与其他软件缺失值的区别

  • NA 不能与数值比较:

有些语言中,缺失值被赋予极大或极小值,超出研究范围中可能的限度,被任务是缺失值。如MODIS-EVI的-0.3被定义为缺失值。在这些环境下,“缺失值”其实还是个值,可以比较。而R中的NA只是个符号,不是个数,不能比较。下面是SAS代码,缺值与0比较是有效的。

data test; 
  input x y; 
  datalines;
2 .
3 4
5 1
6 0
;

data test; set test; 
  lowy = (y < 0);
  missy = (y = .);
run;

proc print data = test; run;

Obs    x    y    lowy    missy

 1     2    .      1       1
 2     3    4      0       0
 3     5    1      0       0
 4     6    0      0       0
 

We can try the equivalent in R.

x1 < 0
#x1 == NA

  • NA用于所有类型的缺失数据

其他语言或软件中,不同类型的缺失值表示方法可能不同,如用空字符串作为string类型的缺失值,用“."作为数值的缺值。在R中NA,string和numeric类型的缺值都用NA表示。

  • 非缺失值不能被解析为缺失

有些语言和软件允许具体数值被定义为”系统缺失值“,在进行数据处理的时候被作为缺失值处理,比如定义-9999作为缺失值。R中不可以这样做,如果从这些数据源中获取数据,必须对这些缺失值进行显式转换,转换为NA。 下面的代码把x1中数值为7的元素转换为NA

is.na(x1) <- which(x1 == 7)
x1

注意,R是elementwise的,不要忘了R是继承了lisp的思想,以及这种列表的概念给我们带来的方便和激动。

is.na既可以发现缺值,也能生成缺值。

R中的缺值选项(NA options in R)

na.action是R环境重要的配置选项之一,它控制着R在作数据处理时遇到NA时所采取默认行为。当函数中指定na.action参数的值时,将覆盖系统默认行为选项。使用options()函数显示当前默认的选项。主要有三种:

  • na.omit和na.exclude

返回删除了缺失值的对象。在以下预测和残差分析时两者会表现出差别。

  • na.pass

不做任何改变,返回原对象。

•na.fail

如果对象存在缺值,抛出异常,不返回对象,在无缺失值是返回原对象。

To see the na.action currently in in options, use getOption("na.action"). We can create a data frame with missing values and see how it is treated with each of the above.

g <- as.data.frame(matrix(c(1:5, NA), ncol = 2))

na.omit(g)

na.exclude(g)

na.fail(g)

na.pass(g)

数据分析中处理缺值

In some R functions, one of the arguments the user can provide is the na.action. For example, if you look at the help for the lm command, you can see that na.action is one of the listed arguments. By default, it will use the na.action specified in the R options. If you wish to use a different na.action for the regression, you can indicate the action in the lm command.

Two common options with lm are the default, na.omit and na.exclude which does not use the missing values, but maintains their position for the residuals and fitted values.

# use the famous anscombe data and set a few to NA
anscombe <- within(anscombe, {
    y1[1:3] <- NA
})
anscombe 
model.omit <- lm(y2 ~ y1, data = anscombe, na.action = na.omit)
model.exclude <- lm(y2 ~ y1, data = anscombe,
    na.action = na.exclude)

compare effects on residuals

resid(model.omit)
resid(model.exclude)

compare effects on fitted (predicted) values

fitted(model.omit)
fitted(model.exclude)

Using na.exclude pads the residuals and fitted values with NAs where there were missing values. Other functions do not use the na.action, but instead have a different argument (with some default) for how they will handle missing values. For example, the mean command will, by default, return NA if there are any NAs in the passed object.

mean(x1)

If you wish to calculate the mean of the non-missing values in the passed object, you can indicate this in the na.rm argument (which is, by default, set to FALSE).

mean(x1, na.rm = TRUE)

Two common commands used in data management and exploration are summary and table. The summary command (when used with numeric vectors) returns the number of NAs in a vector, but the table command ignores NAs by default.

summary(x1)
table(x1)

To see NA among the table output, you can indicate "ifany" or "always" in the useNA argument. The first will show NA in the output only if there is some missing data in the object. The second will include NA in the output regardless.

table(x1, useNA = "ifany")

table(1:3, useNA = "always")

Sorting data containing missing values in R is again different from other packages because NA cannot be compared to other values. By default, sort removes any NA values and can therefore change the length of a vector.

(x1s <- sort(x1))

length(x1s)

The user can specify if NA should be last or first in a sorted order by indicating TRUE or FALSE for the na.last argument.

sort(x1, na.last = TRUE)

No matter the goal of your R code, it is wise to both investigate missing values in your data and use the help files for all functions you use. You should be either aware of and comfortable with the default treatments of missing values or specifying the treatment of missing values you want for your analysis.

多重查补插补

安装: install.packages("mice")

步骤详细介绍:

函数mice()首先从一个包含缺失数据的数据框开始,然后返回一个包含多个(默认为5个)完整数据集的对象。

每个完整数据集都是通过对原始数据框中的缺失数据进行插补而生成的。 由于插补有随机的成分,因此每个完整数据集都略有不同。

然后, with()函数可依次对每个完整数据集应用统计模型(如线性模型或广义线性模型) ,

最后, pool()函数将这些单独的分析结果整合为一组结果。最终模型的标准误和p值都将准确地反映出由于缺失值和多重插补而产生的不确定性。

#多重插补法处理缺失,结果转存
library(lattice) #调入函数包
library(MASS)
library(nnet)
library(mice) #前三个包是mice的基础
imp=mice(inputfile,m=4) #4重插补,即生成4个无缺失数据集
fit=with(imp,lm(sales~date,data=inputfile))#选择插补模型
pooled=pool(fit)
summary(pooled)
result4=complete(imp,action=3)#选择第三个插补数据集作为结果

nhanes是个小店的数据集,含有缺失数据

nhanes
  • 第一步,数据填补
    数据填补是关键一步,对每一个缺失数据填补m(m>1)次。每次填补将产生一个完全数据集,以此类推,共产生m个完全数据集。
imp <- mice(nhanes)
imp
str(imp)

查询各变量实际插入的值

imp$imp$bmi

first completed data matrix

complete(imp)

imputation on mixed data with a different method per column


mice(nhanes2, meth=c('sample','pmm','logreg','norm'))

极大似然估计

极大似然估计在处理缺失值数据时又称作全息极大似然估计(Full Information Maximum Likelihood, FIML),意指使用所有观测变量的全部信息。FIML同ML分析完整数据过程一样,只是在计算单个对数似然值时使用全部完整信息而不考虑缺失值(公示见,Enders, 2006, 2010)。因此,ML处理缺失值并非使用替代值将缺失填补,而是使用已知信息采用迭代的方式估计参数。FIML在MCAR和MAR下产生无偏和有效的参数估计值。当在非正态分布时,FIML需要使用同完整数据时的参数校正统计量(S-Bχ2等,见本章),Bootstrapping也是有效的策略之一。

FIML法包含辅助变量的分析使用Graham (2003)提出的饱和相关模型(Saturated Correlates),即将辅助变量纳入模型中,同时允许辅助变量间、辅助变量与外生观测指标以及内生观测指标的测量误差相关。假设第5章PTSD例子的数据存在缺失值,同时假定性别和年龄为辅助变量,表9-5给了使用FIML估计DSM三因子结构的Mplus程序。

多重插补法(Multiple Imputation, MI)

该方法由Rubin(1987)最早提出,假设在数据随机缺失情况下,用两个或更多能反映数据本身概率分布的值来填补缺失值的方法。一个完整的MI包含三步:数据填补(Imputation Phase),计算(Analysis Phase)和汇总(Pooling Phase)。

  • 第二步,对每一个完全数据集采用标准的完全数据分析方法进行分析。

  • 第三步,将所每次分析得到的结果进行综合,得到最终的统计推断。

根据数据缺失机制、模式以及变量类型,可分别采用不同的方法进行填补:

回归、 预测均数匹配( predictive mean matching, PMM)、 趋势得分( propensity score, PS )、 Logistic回归、 判别分析 马尔可夫链蒙特卡罗( Markov Chain Monte Carlo, MCMC)

与FIML不同,MI采用填补缺失值的方法。MI要求数据缺失为MAR,如果采用ML估计同样要求数据分布符合多元正态分布假设,但研究发现违反正态性假设对MI参数精确性影响不大(Demirtas, Freels, & Yucel, 2008; Schafer, 1997)。另外一个影响估计精确性的因素是缺失率,Demirtas等(2008)的研究发现缺失率高达25%仍能得到精确的参数估计结果。

构建带有缺值的stfdf

# times.rds中已经插入缺失的俩时间记录“2013-07-28”和“2016-03-06”
times<-readRDS("data/imputated/times.rds")
length(times)
str(times)
times<-as.POSIXct(times)

# 从lake.stfdf中导入sities
lake.stfdf<-readRDS("data/lake.stfdf.rds")
sites<-lake.stfdf@sp
str(sites)

#从lake.imp.stfdf中导入数据,这些数据中已经在缺失的两个时间点插入NA
lake.imp.df<-readRDS("data/imputated/lake.imp.df.rds")
#用地点、时间排序
values<-lake.imp.df[order(lake.imp.df$time, lake.imp.df$sp.ID),]
values<-values$EVI
#构建lake.imp.stfdf并保存
IDs=paste("ID", 1:length(values), sep="_")
data<-data.frame(values=values,id=IDs)
lake.imp.stfdf<-STFDF(sites,times, data=data)
stplot(lake.stfdf[,"2007"])
saveRDS(lake.imp.stfdf, "data/imputated/lake.imp.stfdf.rds")

用原始数据插入缺值

###先构造一个一年的数据框,2013年7月28日数据缺失,用改点的全年数据进行插值填充

lake.imp.stfdf<-readRDS("data/imputated/lake.imp.stfdf.rds")
lake.imp.2013.stfdf<-lake.imp.stfdf[,"2013"]
lake.imp.2013.df<-as.data.frame(lake.imp.2013.stfdf)
is.na(lake.imp.2013.df)
str(lake.imp.2013.df)
#删除不需要的变量
lake.imp.2013.df<-lake.imp.2013.df[c(-1, -2,-5,-6,-8)]
#reshap成宽格式
lake.imp.2013.df$time<-as.factor(lake.imp.2013.df$time)
lake.imp.2013.df.wide<-reshape(lake.imp.2013.df, direction = "wide", v.names = "values", timevar = "sp.ID", idvar = "time")
#看看
head(lake.imp.2013.df.wide)
plot(lake.imp.2013.df.wide$values.taihu_101, type="b")


library(mice)
imp<-mice(lake.imp.2013.df.wide)


平均值后再插入缺失值

lake.df<-readRDS("data/lake.df.rds")
lake.df.mean<-tapply(lake.df$EVI, lake.df$time, mean)
lake.df.mean['2013-07-28'] = NA
lake.df.mean['2015-03-06'] = NA
str(lake.df.mean)
names(lake.df.mean)
order(names(lake.df.mean))
lake.df.mean<-lake.df.mean[order(names(lake.df.mean))]

# mice accept only matrix or data.frame as input data, and more than 1 column
lake.df.mean.df<-data.frame(t=names(lake.df.mean), m=lake.df.mean)
lake.df.mean.df
str(lake.df.mean.df)
mi<-mice(lake.df.mean.df)
str(mi)
# get the result
lake.df.mean.df<-complete(mi)
# coerced to numeric vector
m<-lake.df.mean.df$m
names(m)<-lake.df.mean.df$t
str(m)
attr(m, 'names')
lake.df.mean<-m
# we have completed the data, remove the tmp m
rm(m)

原始数据插入缺失值


lake.df<-readRDS("data/lake.df.rds")
head(lake.df)
#为插入缺失的2013-07-28的值,先拷贝它前面的2013-07-12的数据结构
m<-lake.df[lake.df$time=="2013-07-12",]
#编辑时间和EVI值
m$time<-as.POSIXct("2013-07-28")
m$EVI<-NA
#查看一下
head(m)
#插入数据框
m1<-lake.df[lake.df$time<"2013-07-28",]
m3<-lake.df[lake.df$time>"2013-07-28",]
lake.df<-rbind(m1,m,m3)
#插入2015-03-06的数据
m$time<-as.POSIXct("2015-03-06")
m1<-lake.df[lake.df$time<"2015-03-06",]
m3<-lake.df[lake.df$time>"2015-03-06",]
lake.df<-rbind(m1, m, m3)
# 查看一下
str(lake.df)
head(lake.df[lake.df$time=="2013-07-28",])
head(lake.df[lake.df$time=="2015-03-06",])
#删除多余的变量
lake.df$id=NULL
lake.df$timeIndex=NULL
lake.df$endTime=NULL
#更新year和yday变量
lake.df$year<-as.POSIXlt(lake.df$time)$year + 1900
lake.df$yday<-as.POSIXlt(lake.df$time)$yday + 1

#查找出现缺失值的记录index,缺失的值为NA还没有插补

lake.imp.stfdf<-readRDS("data/imputated/lake.imp.na.stfdf.rds")#数据来源于lake.df,添加了缺失的位置
sites<-lake.imp.stfdf@sp
times<-lake.imp.stfdf@time

#删除id列,有了这一列,缺值估计不通过,不知道什么原因
lake.imp.stfdf@data<-lake.imp.stfdf@data[-2]
#注意,如果不加"$EVI",得到的是data.frame的一维索引
na.index<-which(is.na(lake.imp.stfdf@data$EVI))
na.index
#查看具体记录
lake.imp.stfdf@data[na.index,]
# EVI基本不会受到前后1年数据的影响,为减少运算量,提高插值准确性,以每年为单位进行插值估计
#2002-09-30存在确实值
lake.imp.2002.stfdf<-na.spline(lake.imp.stfdf[,"2002"])
#查看有没有插补成功,结果为空list表示插补成功
which(is.na(lake.imp.2002.stfdf@data))
stplot(lake.imp.2002.stfdf)

#插补2013-07-28的缺失值
lake.imp.2013.stfdf<-na.spline(lake.imp.stfdf[,"2013"])
stplot(lake.imp.2013.stfdf)
# plot 2013-07-28 only
spplot(lake.imp.2013.stfdf[,"2013-07-28"])

#插补2015-03-06的缺失值
lake.imp.2015.stfdf<-na.spline(lake.imp.stfdf[,"2015"])
stplot(lake.imp.2015.stfdf)

#重构df,注意,要确保目标data.frame和要插入的data.frame有相同的结构(变量和名称)
lake.imp.df<-as.data.frame(lake.imp.stfdf)
lake.imp.df<-lake.imp.df[c(-5,-6)]
names(lake.imp.df)[5]<-"EVI"
lake.imp.df$year<-as.POSIXlt(lake.imp.df$time)$year+1900
lake.imp.df$yday<-as.POSIXlt(lake.imp.df$time)$yday+1

lake.imp.2002.df<-as.data.frame(lake.imp.2002.stfdf)
lake.imp.2002.df<-lake.imp.2002.df[c(-5,-6)]
names(lake.imp.2002.df)[5]<-"EVI"
lake.imp.2002.df$year<-as.POSIXlt(lake.imp.2002.df$time)$year+1900
lake.imp.2002.df$yday<-as.POSIXlt(lake.imp.2002.df$time)$yday+1
lake.imp.df[lake.imp.df$time=="2002-09-30",]<-lake.imp.2002.df[lake.imp.2002.df$time=="2002-09-30",]
head(lake.imp.df[lake.imp.df$time=="2002-09-30",])


lake.imp.2013.df<-as.data.frame(lake.imp.2013.stfdf)
lake.imp.2013.df<-lake.imp.2013.df[c(-5,-6)]
names(lake.imp.2013.df)[5]<-"EVI"
lake.imp.2013.df$year<-as.POSIXlt(lake.imp.2013.df$time)$year+1900
lake.imp.2013.df$yday<-as.POSIXlt(lake.imp.2013.df$time)$yday+1
lake.imp.df[lake.imp.df$time=="2013-07-28",]<-lake.imp.2013.df[lake.imp.2013.df$time=="2013-07-28",]
head(lake.imp.df[lake.imp.df$time=="2013-07-28",])

lake.imp.2015.df<-as.data.frame(lake.imp.2015.stfdf)
lake.imp.2015.df<-lake.imp.2015.df[c(-5,-6)]
names(lake.imp.2015.df)[5]<-"EVI"
lake.imp.2015.df$year<-as.POSIXlt(lake.imp.2015.df$time)$year+1900
lake.imp.2015.df$yday<-as.POSIXlt(lake.imp.2015.df$time)$yday+1
lake.imp.df[lake.imp.df$time=="2015-03-06",]<-lake.imp.2015.df[lake.imp.2015.df$time=="2015-03-06",]
head(lake.imp.df[lake.imp.df$time=="2015-03-06",])
#检查替换结果
na.index<-which(is.na(lake.imp.df$EVI))
na.index
lake.imp.df[na.index,]

#construct a sTFDF object
lake.imp.df<-lake.imp.df[c(-1,-2,-4)]
lake.imp.stfdf<-STFDF(sites, times, data = lake.imp.df)
saveRDS(lake.imp.stfdf, "data/imputated/lake.imp.stfdf.rds")

load imputated data from lake.imp.df.rds

lake.imp.df<-readRDS("data/imputated/lake.imp.df.rds")
# mean by time
lake.imp.df.mean<-tapply(lake.imp.df$EVI, lake.imp.df$time, mean)

spacetime提供的缺值估计函数

spacetime包提供了内置的缺值估计函数 na.locf、na.approx和na.spline用于估计SPFDF中存在的缺值。 其中na.locf以最后一个非NA值填充确实值,

library(sp)
pts = SpatialPoints(cbind(c(0,1),c(0,1)))
# Sys.setenv(TZ="GMT") should be "CST""
tm = seq(as.POSIXct("2012-11-25"), as.POSIXct("2012-11-30"), "1 day")
df = data.frame(a = c(NA,NA,2,3,NA,NA,NA,2,NA,NA,4,NA), b = c(NA,2,3,4,5,1,2,NA,NA,NA,NA,3))
x = STFDF(pts, tm, df)
as(x, "xts")
as(na.locf(x), "xts")
as(na.locf(x, fromLast = TRUE), "xts")
as(na.locf(na.locf(x), fromLast = TRUE), "xts")
# drops first record:
as(na.approx(x[,,1]), "xts")
# keep it:
cbind(as(na.approx(x[,,1], na.rm=FALSE), "xts"),
as(na.approx(x[,,2]), "xts"))
cbind(as(na.spline(x[,,1]), "xts"),
as(na.spline(x[,,2]), "xts"))
#disaggregate:
xout = seq(start(x), end(x), "6 hours")
as(na.approx(x[,,1], xout = xout), "xts")
as(na.spline(x[,,1], xout = xout), "xts")
as(na.spline(x[,,2], xout = xout), "xts")

# larger/real data:
data(air)
rural = STFDF(stations, dates, data.frame(PM10 = as.vector(air)))
# fill NA's with last non-NA
r = na.locf(rural)
# sample (NOT aggregate) to monthly:
m = seq(start(rural), end(rural), "1 month")
stplot(na.approx(rural[1:20,"2003:2005"], xout = m), mode = 'ts')

zoo的缺失值处理

xts继承zoo,故可以使用zoo的缺失值处理方法进行处理

library(zoo)
#去除含有缺失值的记录
na.omit(x) 
#设为0
x[is.na(x)] = 0
#平均值替代
x[is.na(x)] = mean(x,na.rm=TRUE)
#以中位数替代
x[is.na(x)] = median(x,na.rm=TRUE)
#对缺失值进行线性插值
na.spline(x)   
#对缺失值进行样条插值
na.spline(x)
#末次观测值结转法
na.locf(x)
#去掉最后一个缺失值
na.trim(x,sides="left")
#对timeSreies数据
#去掉首末位置的缺失值
na.omit(x,"ir")
#用替换首末位置的缺失值
na.omit(x, "iz")
#对首末位置的缺失值进行插值
na.omit(x,"ie")
#可以选择插值方法,before末次观测值法,after下次观测结转法
na.omit(x,method="ie",interp=c("before", "linear", "after"))
#返回x中最长的连续无缺失值的序列片段,如果有两个等长的序列片段,则返回第一个。
as.contiguous(x)


my_plot.decomposed.ts = function(x, title="", ...) {
  xx <- x$x
  if (is.null(xx)) 
    xx <- with(x, if (type == "additive") 
      random + trend + seasonal
      else random * trend * seasonal)
  plot(cbind(observed = xx, trend = x$trend, seasonal = x$seasonal, random = x$random), 
       main=title, ...)
}

my_plot.decomposed.ts(lake.imp.na.mean.ts.decom, "My Title")


library(tidyverse) # Includes the packages ggplot2 and tidyr, which we use below

# Get the time values for the time series
Time = attributes(lake.imp.na.mean.ts.decom$x)[[1]]
Time = seq(Time[1],Time[2], length.out=(Time[2]-Time[1])*Time[3])

# Convert td to data frame
dat = cbind(Time, with(lake.imp.na.mean.ts.decom, data.frame(Observed=x, Trend=trend, Seasonal=seasonal, Random=random)))

ggplot(gather(dat, component, value, -Time), aes(Time, value)) +
  facet_grid(component ~ ., scales="free_y") +
  geom_line() +
  theme_bw() +
  labs(y=expression(CO[2]~(ppm)), x="Year") +
  ggtitle(expression(Decomposed~CO[2]~Time~Series)) +
  theme(plot.title=element_text(hjust=0.5))

par(oma=c(4.5,4.5,4.5,4.5)) par(xpd=TRUE) par(xpd=NA) plot(xx$times, xx$observed, type="l", ylab="观测值 Observed", xlab='', xaxt='n') plot(xx$times, xx$trend, type="l", ylab="趋势项 Trend", xlab='', xaxt='n') plot(xx$times, xx$seasonal, type="l", ylab="季节项 Seasonal", xlab='', xaxt='n') plot(xx$times, xx$random, type="l", ylab="误差项 Random", xlab="观测时间")