2015-04-19 7 views
5

Ho due set di dati (df1 e df2) ed entrambi sono composti da valori formattati nel tempo. Voglio fare come "obiettivo fuori". Mentre unisco due dati per c ("id1", "id2"), voglio lasciare "NA" in tempo non sovrapposto.Come unire i dati del frame temporale lasciando NA per le parti non sovrapposte?

df1

id1 id2  click_timing 
1  11  2015-02-03 01:00:00  
1  11  2015-02-03 02:00:00  
1  12  2015-02-03 03:00:00  
1  12  2015-02-03 04:00:00  
1  13  2015-02-03 05:10:00  
2  34  2015-02-03 03:00:00  
2  34  2015-02-03 04:00:00  
2  36  2015-02-03 01:00:00 
...  

DF2

id1 id2  start       end 
1  11  2015-02-03 00:20:00  2015-02-03 00:40:00 
1  11  2015-02-03 00:50:00  2015-02-03 01:20:00 
1  13  2015-02-03 01:10:00  2015-02-03 01:40:00  
1  13  2015-02-03 04:50:00  2015-02-03 05:30:00  
2  34  2015-02-03 03:50:00  2015-02-03 04:10:00  
... 

uscita obiettivo

id1 id2  click_timing    start     end 
1  11    NA    2015-02-03 00:20:00  2015-02-03 00:40:00 
1  11  2015-02-03 01:00:00 2015-02-03 00:50:00  2015-02-03 01:20:00 
1  11  2015-02-03 02:00:00   NA     NA 
1  12  2015-02-03 03:00:00   NA     NA 
1  12  2015-02-03 04:00:00   NA     NA 
1  13    NA    2015-02-03 01:10:00  2015-02-03 01:40:00  
1  13  2015-02-03 05:10:00 2015-02-03 04:50:00  2015-02-03 05:30:00 
2  34  2015-02-03 03:00:00   NA     NA  
2  34  2015-02-03 04:00:00  2015-02-03 03:50:00  2015-02-03 04:10:00 
2  36  2015-02-03 01:00:00   NA     NA 
...  
+0

Ho provato con unione (df1, DF2, by = c ("ID1", "ID2")) cambiando all.x = T e all.y = T. Non so esattamente perché non funzioni, ma voglio lasciare NA per valori ineguagliati. –

risposta

1

problema difficile! Io penso che si debba calcolare l'intersezione tra ogni singolo valore click_timing e ogni periodo di tempo (start e end) da loop manualmente attraverso tutti click_timing valori, e quindi utilizzare le partite indice risultante come un ulteriore campo di join:

df1 <- data.frame(id1=c(1,1,1,1,1,2,2,2), id2=c(11,11,12,12,13,34,34,36), click_timing=as.POSIXct(c('2015-02-03 01:00:00','2015-02-03 02:00:00','2015-02-03 03:00:00','2015-02-03 04:00:00','2015-02-03 05:10:00','2015-02-03 03:00:00','2015-02-03 04:00:00','2015-02-03 01:00:00'))); 
df2 <- data.frame(id1=c(1,1,1,1,2), id2=c(11,11,13,13,34), start=as.POSIXct(c('2015-02-03 00:20:00','2015-02-03 00:50:00','2015-02-03 01:10:00','2015-02-03 04:50:00','2015-02-03 03:50:00')), end=as.POSIXct(c('2015-02-03 00:40:00','2015-02-03 01:20:00','2015-02-03 01:40:00','2015-02-03 05:30:00','2015-02-03 04:10:00'))); 
m <- sapply(1:nrow(df1), function(i) which(df1$id1[i]==df2$id1 & df1$id2[i] == df2$id2 & df1$click_timing[i]>=df2$start & df1$click_timing[i]<=df2$end)[1]); 
merge(cbind(df1,m=m),cbind(df2,m=1:nrow(df2)),by=c('id1','id2','m'),all=T)[-3]; 
## id1 id2  click_timing    start     end 
## 1 1 11    <NA> 2015-02-03 00:20:00 2015-02-03 00:40:00 
## 2 1 11 2015-02-03 01:00:00 2015-02-03 00:50:00 2015-02-03 01:20:00 
## 3 1 11 2015-02-03 02:00:00    <NA>    <NA> 
## 4 1 12 2015-02-03 04:00:00    <NA>    <NA> 
## 5 1 12 2015-02-03 03:00:00    <NA>    <NA> 
## 6 1 13    <NA> 2015-02-03 01:10:00 2015-02-03 01:40:00 
## 7 1 13 2015-02-03 05:10:00 2015-02-03 04:50:00 2015-02-03 05:30:00 
## 8 2 34 2015-02-03 04:00:00 2015-02-03 03:50:00 2015-02-03 04:10:00 
## 9 2 34 2015-02-03 03:00:00    <NA>    <NA> 
## 10 2 36 2015-02-03 01:00:00    <NA>    <NA> 

Se ci sarà mai un caso in cui un singolo valore click_timing interseca con più coppie start e end, allora questa soluzione selezionerà quella che si verifica prima (ovvero ha un indice di riga inferiore in df2) rispetto alle altre corrispondenze.

1

Recreating cornice dati iniziali e facendo alcune preparazioni minori:

library(data.table) 
library(lubridate) 

df1<- fread("id1,id2,click_timing 
1,11,2015-02-03 01:00:00 
1,11,2015-02-03 02:00:00 
1,12,2015-02-03 03:00:00 
1,12,2015-02-03 04:00:00 
1,13,2015-02-03 05:10:00 
2,34,2015-02-03 03:00:00 
2,34,2015-02-03 04:00:00 
2,36,2015-02-03 01:00:00") 

# adding a redundant click_timing2 column to use as the end range for further foverlaps() function 
df1[, click_timing2:= click_timing] 
df1[,c("click_timing", "click_timing2"):= list(parse_date_time(click_timing, "%Y-%m-%d %T"), parse_date_time(click_timing2, "%Y-%m-%d %T"))] 


df2<- fread("id1,id2,start,end 
1,11,2015-02-03 00:20:00,2015-02-03 00:40:00 
1,11,2015-02-03 00:50:00,2015-02-03 01:20:00 
1,13,2015-02-03 01:10:00,2015-02-03 01:40:00 
1,13,2015-02-03 04:50:00,2015-02-03 05:30:00 
2,34,2015-02-03 03:50:00,2015-02-03 04:10:00") 

df2[,c("start","end") := list(parse_date_time(start, "%Y-%m-%d %T"), parse_date_time(end, "%Y-%m-%d %T"))] 
setkey(df2, id1, id2, start, end) 

Soluzione:

df3<- foverlaps(df1, df2, by.x=c("id1", "id2", "click_timing", "click_timing2"), 
          by.y = c("id1", "id2", "start", "end"), type="within") 
objective_output<- merge(df3, df2, by = c("id1", "id2", "start", "end"), all = T) 
# deleting redundant click_timing2 column 
objective_output[,click_timing2:= NULL] 
# reordering columns 
setcolorder(objective_output, c(1,2,5,3,4)) 
#setting key using all columns and thus reordering all rows 
setkey(objective_output) 
objective_output 
#id1 id2  click_timing    start     end 
# 1: 1 11 2015-02-03 02:00:00    <NA>    <NA> 
# 2: 1 11    <NA> 2015-02-03 00:20:00 2015-02-03 00:40:00 
# 3: 1 11 2015-02-03 01:00:00 2015-02-03 00:50:00 2015-02-03 01:20:00 
# 4: 1 12 2015-02-03 03:00:00    <NA>    <NA> 
# 5: 1 12 2015-02-03 04:00:00    <NA>    <NA> 
# 6: 1 13    <NA> 2015-02-03 01:10:00 2015-02-03 01:40:00 
# 7: 1 13 2015-02-03 05:10:00 2015-02-03 04:50:00 2015-02-03 05:30:00 
# 8: 2 34 2015-02-03 03:00:00    <NA>    <NA> 
# 9: 2 34 2015-02-03 04:00:00 2015-02-03 03:50:00 2015-02-03 04:10:00 
#10: 2 36 2015-02-03 01:00:00    <NA>    <NA>