Photo by Zach Reiner @ unsplash.com

The effect of the weather on adolescents' physical activity.

Hi all!

While I’m a writing this blog, I’m sitting in the train. Soaked… again. It has been raining for quite some weeks now and the weather report doesn’t give me much hope for the next weeks. I certainly don’t want to go outside to train for the half-marathon. To make myself a little less depressed, I think it is time for a new blog.

In this blog, I wanted to see whether I am the only one, or whether the kids in the MyMovez project sample also are less physically active on rainy days.

But before we start, first let is get in the mood by playing the sound on this website.

The data

The data of the project is publicly available, clickettyclickclick. In addition, the KNMI (Royal Netherlands Meteorological Institute) shares the day-to-day weather data on their website. For a conference poster, I downloaded the data from 27-01-2016 until 04-07-2018. This data is already merged to the physical activity data, so you don’t have to do this yourself.

The first step is to load the downloaded MyMovez data. More specifically, we unzip the file and only need the physical activity data that is stored in the folder “5. Fitbit data”. Let’s see what type of data we have here…

## -- Attaching packages ------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.0     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## Warning: package 'tidyr' was built under R version 3.6.2
## -- Conflicts ---------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
##   number_of_dates
## 1             224
##         Date
## 1 2016-01-29
## 2 2016-01-28
## 3 2016-01-30
## 4 2016-04-02
## 5 2016-04-01
## 6 2016-03-31
##           Date
## 219 2018-04-16
## 220 2018-05-23
## 221 2018-03-21
## 222  20-5-2018
## 223  19-5-2018
## 224  18-5-2018

Apparently, there is something wrong with three dates: 18-5-2018 until 20-5-2018 are stored in a different format. Let’s re-code them

FB<-FB %>%
  mutate(Date=recode(Date, 
                     '18-5-2018' = "2018-05-18",
                     '19-5-2018' = "2018-05-19",
                     '20-5-2018' = "2018-05-20"))

FB%>%
  distinct(Date) %>%
  summarize(number_of_dates = n()) # 221 dates
##   number_of_dates
## 1             221

The factor ‘Date’ is correct now, so we can make it a date variable. For this we will use the lubridate package.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date

FB <- FB %>%
  mutate(LDate = lubridate::date(Date))

FB %>% 
  select(LDate) %>%
  summarize(begin = min(LDate),
            end = max(LDate))
##        begin        end
## 1 2016-01-27 2018-07-04

Indeed the first day of data is 27th of January 2016 and the last date is the 4th of July in 2018.

Manipulating the data

In this blog, I want to make two new variables. One that informs us what year it is, the other the day number in that year. For example, 4th of July in 2018 is in the year 2018 (duh) and the 4th of July is the …

lubridate::yday("2018-07-04")
## [1] 185

… 185th day of the year of course. I knew that. this function is not on the cheatsheet, but believe me, it works.

We can use the same function for all the dates:

FB <- FB %>%
  mutate(Year = lubridate::year(LDate),      # this is the year
         YDay = lubridate::yday(LDate)) %>%  # this is the number of day in that year
  mutate(YearF = as.factor(Year))            # make the year a factor

Let’s see how many data-points per day we have.

library(ggplot2)

ggplot(data = FB, aes(x = YDay, fill=as.factor(Wave))) + 
  geom_histogram(alpha=.6, bins = 150) + 
  facet_grid(rows = vars(YearF)) +
  ggtitle ("Number of observations per day number for the three years") +
  xlab("Day number of the year") +  ylab ("Number of observations") +
  labs(fill = "Wave") +
  theme(plot.title = element_text(hjust = 0.5))

Weather conditions

Next, let’s see what the weather was like on those days. For that we first will take only the unique dates and there corresponding weather conditions.

FB %>%
  distinct(LDate, Wave, YearF, YDay, Temp_mean, Hours_sun, Prec_hours, Humid) %>%
  ggplot() +
  geom_col(aes(x = YDay, y = Prec_hours), fill = "darkblue", alpha = .6) +
  geom_point(aes(x = YDay, y = Hours_sun), color = "darkgoldenrod1", alpha = .6) +
  geom_smooth(aes (x = YDay, y = Temp_mean, colour = as.factor(Wave)), alpha = .6, span = 1, method = 'lm', se = FALSE) +
  facet_grid(rows = vars(YearF)) +
  ggtitle ("Weather conditions per day number per year") +
  xlab("Day of the year") +  ylab ("") +
  labs(colour = "Wave") +
  theme(plot.title = element_text(hjust = 0.5))

The blue bars represent the hours of precipitation (a fancy word for rain). The yellow dots represent the sun, at least the number of hours that you could see the sun shine. The colored lines are the regression lines of the mean temperature per wave. We are now in March (around day number 70), so it is nice to see that the average temperature will go up in the end.

Physical Activity

Now let’s add the physical activity data. More specifically, lets add the average number of steps per day. In order to do this, we need to calculate the mean and standard deviations of steps per day. In addition, I divided the number of steps by 1000 so it is number of K steps per day. To keep the graphic a clear, I have excluded the average temperature and created a smooth line of the mean physical activity.


FB %>%
  filter(Imputed == "observed") %>% # only use the observed data
  group_by(Wave, YearF, YDay, Hours_sun, Prec_hours) %>%
  summarize(StepsK_mean = mean(Steps)/1000,
            StepsK_sd = sd(Steps)/1000) %>%
  ungroup() %>%
    ggplot() +
  geom_col(aes(x = YDay, y = Prec_hours), fill = "darkblue", alpha = .6) +
  geom_point(aes(x = YDay, y = Hours_sun), color = "darkgoldenrod1", alpha = .6) +
  geom_smooth(aes (x = YDay, y = StepsK_mean, colour = as.factor(Wave)), alpha = .6, span = 1, method = 'loess', se = FALSE) +
  facet_grid(rows = vars(YearF)) +
  ggtitle ("Physical activity and weather conditions over the three year") +
  xlab("Day of the year") +  ylab ("") +
  labs(colour = "Wave") +
  theme(plot.title = element_text(hjust = 0.5))

As you can observe, the weather (at least hours of sun and rain) seem to predict psychical activity in Wave 3 and perhaps also in wave 4. Fortunately we have the data and we can test whether the weather has an effect on adolescents physical activity.

Because we have multiple measures per participant and the participants are nested in classrooms, we will use a mixed effects model to control for the clustering of the data. But first, we want to make sure that the data is normally distributed. Therefore we use the Shapiro-Wilk’s test and print out the histogram of the mean minutes of MVPA. The null hypothesis of the Shapiro-Wilk’s test is that the data is normally distributed. As you can see, the plot looks sort of normally distributed, but this is not the case according to a significant Shapiro-Wilk’s test.

FB2<-FB %>%
  filter(Imputed == "observed" & Steps >999) %>%
  group_by(Child) %>%
  summarize(StepsK_mean = mean(Steps)/1000) %>%
  ungroup()

hist(FB2$StepsK_mean)


shapiro.test(FB2$StepsK_mean)
## 
##  Shapiro-Wilk normality test
## 
## data:  FB2$StepsK_mean
## W = 0.97592, p-value = 1.2e-14

Mixed effects model with lme4

We should fix that, but we are not going to do this now. For now I will focus on the mixed effects model.

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Attaching package: 'lmerTest'
## The following object is masked from 'package:lme4':
## 
##     lmer
## The following object is masked from 'package:stats':
## 
##     step
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: 
## Steps ~ Temp_mean + Hours_sun + Prec_hours + Humid + as.factor(Sex) +  
##     Age + (1 | School) + (1 | Class) + (1 | Child) + (1 | Day)
##    Data: FB
## 
## REML criterion at convergence: 522636.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.5764 -0.6527 -0.0689  0.5600  7.1169 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Child    (Intercept)  5404700 2324.8  
##  Class    (Intercept)   135685  368.4  
##  School   (Intercept)   233903  483.6  
##  Day      (Intercept)   917769  958.0  
##  Residual             22881153 4783.4  
## Number of obs: 26297, groups:  Child, 1481; Class, 132; School, 28; Day, 7
## 
## Fixed effects:
##                  Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)     14643.146    732.255    51.359  19.997  < 2e-16 ***
## Temp_mean          58.812      5.335 24206.740  11.025  < 2e-16 ***
## Hours_sun          60.850     10.804 24599.970   5.632 1.80e-08 ***
## Prec_hours       -125.907     12.576 22055.426 -10.012  < 2e-16 ***
## Humid              12.300      4.150 24063.573   2.964  0.00304 ** 
## as.factor(Sex)1 -1101.764    138.300  1446.305  -7.966 3.28e-15 ***
## Age              -569.221     40.206  1281.668 -14.158  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Tmp_mn Hrs_sn Prc_hr Humid  a.(S)1
## Temp_mean   -0.039                                   
## Hours_sun   -0.330 -0.230                            
## Prec_hours  -0.069  0.003  0.242                     
## Humid       -0.506 -0.037  0.587 -0.166              
## as.fctr(S)1 -0.111  0.000  0.003  0.004  0.003       
## Age         -0.659 -0.009 -0.005  0.146  0.046  0.008

If we look at the summary output we see under the Random Effects that the residual variance on the different levels are: 5403200; 135956; 233531; 920732. This means that the conditional explained variance of these four levels is:

(5403200+135956+233531+920732)/(5403200+135956+233531+920732+22881407)
## [1] 0.2263215

And when we look at the fixed effects, we see that there is a negative effect of sex (girls are less active than boys) and age (the older the participants are, the less active they are). If you only want to print out the fixed effects, you can use:

##                    Estimate Std. Error          df    t value     Pr(>|t|)
## (Intercept)     14643.14574 732.254601    51.35903  19.997342 6.812609e-26
## Temp_mean          58.81247   5.334634 24206.73977  11.024651 3.392223e-28
## Hours_sun          60.85028  10.803518 24599.97027   5.632451 1.796056e-08
## Prec_hours       -125.90697  12.576115 22055.42573 -10.011595 1.521894e-23
## Humid              12.30023   4.150462 24063.57293   2.963580 3.043802e-03
## as.factor(Sex)1 -1101.76444 138.299835  1446.30516  -7.966491 3.280158e-15
## Age              -569.22068  40.205623  1281.66756 -14.157738 2.190786e-42

Moreover, we see positive effects of the temperature, the hours of sunshine and the humidity. Also a negative effect of the hours of rain per day. Let’s make four graphs based on these results.

Visualize the separate effects.

First up, the effect of the mean temperature. The alpha of the points in the graph is set to the number of observations that is used to calculate the average score. So the higher the alpha (the more visible the dot), the more data is used. I have added a grey geom_smooth with a grey line and a red line. The red line is the regression coefficient the grey follows the data. This indicates that this effect is not a perfect linear function and seems to weaken after 20 degrees. Which makes sense, because then it will become too hot to be physically active. Second, I have made a similar plot for the effect of the hours of sunshine, that provides a similar pattern.

Here we observe the same pattern. This does make sense as I can image than these two variables are correlated. However, when we check in the lme4 output, we see that there was a negative correlation between those two (when controlled for the other variables). Let’s investigate this.

Are the two variables normally distributed?

FB3<-FB %>%
  distinct(LDate, Temp_mean, Hours_sun)

hist(FB3$Temp_mean)

shapiro.test(FB3$Temp_mean)
## 
##  Shapiro-Wilk normality test
## 
## data:  FB3$Temp_mean
## W = 0.97645, p-value = 0.0009455

hist(FB3$Hours_sun)

shapiro.test(FB3$Hours_sun)
## 
##  Shapiro-Wilk normality test
## 
## data:  FB3$Hours_sun
## W = 0.94282, p-value = 0.0000001244

Nope! So let’s use a rank-based correlation:


cor.test(FB3$Temp_mean, 
         FB3$Hours_sun,
         method = "spearman")
## Warning in cor.test.default(FB3$Temp_mean, FB3$Hours_sun, method =
## "spearman"): Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  FB3$Temp_mean and FB3$Hours_sun
## S = 1394893, p-value = 0.0007707
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.2246031

ggplot(data = FB3) +
  geom_point(aes(x = Hours_sun, y = Temp_mean), color= "orange",  alpha = .4) +
  geom_smooth(aes(x = Hours_sun, y = Temp_mean), color= "red", alpha = .6, span = 1, method = 'lm', se = TRUE)

And what is the effect of the humidity on physical activity?

In this plot you can observe that the relationship is probably not linear. Both very low and very high humid days predict less physical activity. The regression line is almost flat.

And the last plot, the plot that I anticipated, the effect of the rain on physical activity:

So there you have it. The more it rains, the less physically active the kids in the MyMovez sample are. And does this means that i have a good excuse the skip practice for tonight?

Later!

Related