Hi all!

While I’m a writing this blog, I’m sitting in the train. Soaked… again. It has been raining for quite some weeks now and the weather report doesn’t give me much hope for the next weeks. I certainly don’t want to go outside to train for the half-marathon. To make myself a little less depressed, I think it is time for a new blog.

In this blog, I wanted to see whether I am the only one, or whether the kids in the MyMovez project sample also are less physically active on rainy days.

But before we start, first let is get in the mood by playing the sound on this website.

### The data

The data of the project is publicly available, clickettyclickclick. In addition, the KNMI (Royal Netherlands Meteorological Institute) shares the day-to-day weather data on their website. For a conference poster, I downloaded the data from 27-01-2016 until 04-07-2018. This data is already merged to the physical activity data, so you don’t have to do this yourself.

The first step is to load the downloaded MyMovez data. More specifically, we unzip the file and only need the physical activity data that is stored in the folder “5. Fitbit data”. Let’s see what type of data we have here…

```
## -- Attaching packages ------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.0 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'tidyr' was built under R version 3.6.2
## -- Conflicts ---------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## number_of_dates
## 1 224
## Date
## 1 2016-01-29
## 2 2016-01-28
## 3 2016-01-30
## 4 2016-04-02
## 5 2016-04-01
## 6 2016-03-31
## Date
## 219 2018-04-16
## 220 2018-05-23
## 221 2018-03-21
## 222 20-5-2018
## 223 19-5-2018
## 224 18-5-2018
```

Apparently, there is something wrong with three dates: 18-5-2018 until 20-5-2018 are stored in a different format. Let’s re-code them

```
FB<-FB %>%
mutate(Date=recode(Date,
'18-5-2018' = "2018-05-18",
'19-5-2018' = "2018-05-19",
'20-5-2018' = "2018-05-20"))
FB%>%
distinct(Date) %>%
summarize(number_of_dates = n()) # 221 dates
## number_of_dates
## 1 221
```

The factor ‘Date’ is correct now, so we can make it a date variable. For this we will use the lubridate package.

```
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
FB <- FB %>%
mutate(LDate = lubridate::date(Date))
FB %>%
select(LDate) %>%
summarize(begin = min(LDate),
end = max(LDate))
## begin end
## 1 2016-01-27 2018-07-04
```

Indeed the first day of data is 27th of January 2016 and the last date is the 4th of July in 2018.

### Manipulating the data

In this blog, I want to make two new variables. One that informs us what year it is, the other the day number in that year. For example, 4th of July in 2018 is in the year 2018 (duh) and the 4th of July is the …

```
lubridate::yday("2018-07-04")
## [1] 185
```

… 185th day of the year of course. I knew that. this function is not on the cheatsheet, but believe me, it works.

We can use the same function for all the dates:

```
FB <- FB %>%
mutate(Year = lubridate::year(LDate), # this is the year
YDay = lubridate::yday(LDate)) %>% # this is the number of day in that year
mutate(YearF = as.factor(Year)) # make the year a factor
```

Let’s see how many data-points per day we have.

```
library(ggplot2)
ggplot(data = FB, aes(x = YDay, fill=as.factor(Wave))) +
geom_histogram(alpha=.6, bins = 150) +
facet_grid(rows = vars(YearF)) +
ggtitle ("Number of observations per day number for the three years") +
xlab("Day number of the year") + ylab ("Number of observations") +
labs(fill = "Wave") +
theme(plot.title = element_text(hjust = 0.5))
```

### Weather conditions

Next, let’s see what the weather was like on those days. For that we first will take only the unique dates and there corresponding weather conditions.

```
FB %>%
distinct(LDate, Wave, YearF, YDay, Temp_mean, Hours_sun, Prec_hours, Humid) %>%
ggplot() +
geom_col(aes(x = YDay, y = Prec_hours), fill = "darkblue", alpha = .6) +
geom_point(aes(x = YDay, y = Hours_sun), color = "darkgoldenrod1", alpha = .6) +
geom_smooth(aes (x = YDay, y = Temp_mean, colour = as.factor(Wave)), alpha = .6, span = 1, method = 'lm', se = FALSE) +
facet_grid(rows = vars(YearF)) +
ggtitle ("Weather conditions per day number per year") +
xlab("Day of the year") + ylab ("") +
labs(colour = "Wave") +
theme(plot.title = element_text(hjust = 0.5))
```

The blue bars represent the hours of precipitation (a fancy word for rain). The yellow dots represent the sun, at least the number of hours that you could see the sun shine. The colored lines are the regression lines of the mean temperature per wave. We are now in March (around day number 70), so it is nice to see that the average temperature will go up in the end.

### Physical Activity

Now let’s add the physical activity data. More specifically, lets add the average number of steps per day. In order to do this, we need to calculate the mean and standard deviations of steps per day. In addition, I divided the number of steps by 1000 so it is number of K steps per day. To keep the graphic a clear, I have excluded the average temperature and created a smooth line of the mean physical activity.

```
FB %>%
filter(Imputed == "observed") %>% # only use the observed data
group_by(Wave, YearF, YDay, Hours_sun, Prec_hours) %>%
summarize(StepsK_mean = mean(Steps)/1000,
StepsK_sd = sd(Steps)/1000) %>%
ungroup() %>%
ggplot() +
geom_col(aes(x = YDay, y = Prec_hours), fill = "darkblue", alpha = .6) +
geom_point(aes(x = YDay, y = Hours_sun), color = "darkgoldenrod1", alpha = .6) +
geom_smooth(aes (x = YDay, y = StepsK_mean, colour = as.factor(Wave)), alpha = .6, span = 1, method = 'loess', se = FALSE) +
facet_grid(rows = vars(YearF)) +
ggtitle ("Physical activity and weather conditions over the three year") +
xlab("Day of the year") + ylab ("") +
labs(colour = "Wave") +
theme(plot.title = element_text(hjust = 0.5))
```

As you can observe, the weather (at least hours of sun and rain) seem to predict psychical activity in Wave 3 and perhaps also in wave 4. Fortunately we have the data and we can test whether the weather has an effect on adolescents physical activity.

Because we have multiple measures per participant and the participants are nested in classrooms, we will use a mixed effects model to control for the clustering of the data. But first, we want to make sure that the data is normally distributed. Therefore we use the Shapiro-Wilk’s test and print out the histogram of the mean minutes of MVPA. The null hypothesis of the Shapiro-Wilk’s test is that the data is normally distributed. As you can see, the plot looks sort of normally distributed, but this is not the case according to a significant Shapiro-Wilk’s test.

```
FB2<-FB %>%
filter(Imputed == "observed" & Steps >999) %>%
group_by(Child) %>%
summarize(StepsK_mean = mean(Steps)/1000) %>%
ungroup()
hist(FB2$StepsK_mean)
```

```
shapiro.test(FB2$StepsK_mean)
##
## Shapiro-Wilk normality test
##
## data: FB2$StepsK_mean
## W = 0.97592, p-value = 1.2e-14
```

### Mixed effects model with lme4

We should fix that, but we are not going to do this now. For now I will focus on the mixed effects model.

```
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Attaching package: 'lmerTest'
## The following object is masked from 'package:lme4':
##
## lmer
## The following object is masked from 'package:stats':
##
## step
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula:
## Steps ~ Temp_mean + Hours_sun + Prec_hours + Humid + as.factor(Sex) +
## Age + (1 | School) + (1 | Class) + (1 | Child) + (1 | Day)
## Data: FB
##
## REML criterion at convergence: 522636.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.5764 -0.6527 -0.0689 0.5600 7.1169
##
## Random effects:
## Groups Name Variance Std.Dev.
## Child (Intercept) 5404700 2324.8
## Class (Intercept) 135685 368.4
## School (Intercept) 233903 483.6
## Day (Intercept) 917769 958.0
## Residual 22881153 4783.4
## Number of obs: 26297, groups: Child, 1481; Class, 132; School, 28; Day, 7
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 14643.146 732.255 51.359 19.997 < 2e-16 ***
## Temp_mean 58.812 5.335 24206.740 11.025 < 2e-16 ***
## Hours_sun 60.850 10.804 24599.970 5.632 1.80e-08 ***
## Prec_hours -125.907 12.576 22055.426 -10.012 < 2e-16 ***
## Humid 12.300 4.150 24063.573 2.964 0.00304 **
## as.factor(Sex)1 -1101.764 138.300 1446.305 -7.966 3.28e-15 ***
## Age -569.221 40.206 1281.668 -14.158 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) Tmp_mn Hrs_sn Prc_hr Humid a.(S)1
## Temp_mean -0.039
## Hours_sun -0.330 -0.230
## Prec_hours -0.069 0.003 0.242
## Humid -0.506 -0.037 0.587 -0.166
## as.fctr(S)1 -0.111 0.000 0.003 0.004 0.003
## Age -0.659 -0.009 -0.005 0.146 0.046 0.008
```

If we look at the summary output we see under the Random Effects that the residual variance on the different levels are: 5403200; 135956; 233531; 920732. This means that the conditional explained variance of these four levels is:

```
(5403200+135956+233531+920732)/(5403200+135956+233531+920732+22881407)
## [1] 0.2263215
```

And when we look at the fixed effects, we see that there is a negative effect of sex (girls are less active than boys) and age (the older the participants are, the less active they are). If you only want to print out the fixed effects, you can use:

```
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 14643.14574 732.254601 51.35903 19.997342 6.812609e-26
## Temp_mean 58.81247 5.334634 24206.73977 11.024651 3.392223e-28
## Hours_sun 60.85028 10.803518 24599.97027 5.632451 1.796056e-08
## Prec_hours -125.90697 12.576115 22055.42573 -10.011595 1.521894e-23
## Humid 12.30023 4.150462 24063.57293 2.963580 3.043802e-03
## as.factor(Sex)1 -1101.76444 138.299835 1446.30516 -7.966491 3.280158e-15
## Age -569.22068 40.205623 1281.66756 -14.157738 2.190786e-42
```

Moreover, we see positive effects of the temperature, the hours of sunshine and the humidity. Also a negative effect of the hours of rain per day. Let’s make four graphs based on these results.

### Visualize the separate effects.

First up, the effect of the mean temperature. The alpha of the points in the graph is set to the number of observations that is used to calculate the average score. So the higher the alpha (the more visible the dot), the more data is used. I have added a grey geom_smooth with a grey line and a red line. The red line is the regression coefficient the grey follows the data. This indicates that this effect is not a perfect linear function and seems to weaken after 20 degrees. Which makes sense, because then it will become too hot to be physically active. Second, I have made a similar plot for the effect of the hours of sunshine, that provides a similar pattern.

Here we observe the same pattern. This does make sense as I can image than these two variables are correlated. However, when we check in the lme4 output, we see that there was a negative correlation between those two (when controlled for the other variables). Let’s investigate this.

Are the two variables normally distributed?

```
FB3<-FB %>%
distinct(LDate, Temp_mean, Hours_sun)
hist(FB3$Temp_mean)
```

```
shapiro.test(FB3$Temp_mean)
##
## Shapiro-Wilk normality test
##
## data: FB3$Temp_mean
## W = 0.97645, p-value = 0.0009455
hist(FB3$Hours_sun)
```

```
shapiro.test(FB3$Hours_sun)
##
## Shapiro-Wilk normality test
##
## data: FB3$Hours_sun
## W = 0.94282, p-value = 0.0000001244
```

Nope! So let’s use a rank-based correlation:

```
cor.test(FB3$Temp_mean,
FB3$Hours_sun,
method = "spearman")
## Warning in cor.test.default(FB3$Temp_mean, FB3$Hours_sun, method =
## "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: FB3$Temp_mean and FB3$Hours_sun
## S = 1394893, p-value = 0.0007707
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.2246031
ggplot(data = FB3) +
geom_point(aes(x = Hours_sun, y = Temp_mean), color= "orange", alpha = .4) +
geom_smooth(aes(x = Hours_sun, y = Temp_mean), color= "red", alpha = .6, span = 1, method = 'lm', se = TRUE)
```

And what is the effect of the humidity on physical activity?

In this plot you can observe that the relationship is probably not linear. Both very low and very high humid days predict less physical activity. The regression line is almost flat.

And the last plot, the plot that I anticipated, the effect of the rain on physical activity:

So there you have it. The more it rains, the less physically active the kids in the MyMovez sample are. And does this means that i have a good excuse the skip practice for tonight?

Later!