Bliese Multilevel

(1)

Multilevel Modeling in R (2.6)

A Brief Introduction to R, the multilevel package and the nlme package

Paul Bliese (paul.bliese@moore.sc.edu) August 3, 2016

(2)

Copyright © 2016, Paul Bliese. Permission is granted to make and distribute verbatim copies of this document provided the copyright notice and this permission notice are preserved on all copies. For other permissions, please contact Paul Bliese at paul.bliese@moore.sc.edu. Chapters 1 and 2 of this document of this document borrow heavily from An Introduction to R (see the copyright notice below)

An Introduction to R

Notes on R: A Programming Environment for Data Analysis and Graphics Version 1.1.1 (2000 August 15)

R Development Core Team.

Copyright c 1990, 1992 W. Venables Copyright c 1997, R. Gentleman & R. Ihaka Copyright c 1997, 1998 M. M.Achler

Copyright c 1999, 2000 R Development Core Team

Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies. Permission is granted to copy and distribute modified versions of this manual under the conditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one. Permission is granted to copy and distribute translations of this manual into another language, under the above conditions for modified versions, except that this permission notice may be stated in a translation approved by the R Development Core Team

(3)

Table of Contents

1 Introduction ... 5

2 An Introduction to R ... 6

2.1 Overview ... 6

2.1.1 Related software and documentation ... 6

2.1.2 R and statistics ... 6

2.1.3 Obtaining R and the multilevel package ... 7

2.1.4 Data permanency and removing objects ... 7

2.1.5 Running R for Different Projects ... 8

2.1.6 Recall and correction of previous commands ... 8

2.1.7 Getting help with functions and features ... 8

2.1.8 R commands, case sensitivity, etc... 9

2.2 Simple manipulations; numbers and vectors ... 9

2.2.1 Vectors and assignment ... 9

2.2.2 Missing values ... 10

2.3 Dataframes ... 10

2.3.1 Introduction to dataframes ... 10

2.3.2 Making dataframes... 11

2.3.3 Managing the search path ... 11

2.4 Reading data from files ... 12

2.4.1 Reading Spreadsheet (EXCEL) data ... 12

2.4.2 The extremely useful "clipboard" option ... 14

2.4.3 The foreign package and SPSS files ... 15

2.4.4 Using file.choose to bring up a GUI to read data ... 17

2.4.5 Checking your dataframes with str , summary, and head ... 17

2.4.6 Loading data from packages ... 18

2.4.7 Exporting data to spreadsheets using write() and write.table() ... 18

2.5 More on using matrix brackets on dataframes ... 19

2.6 Identifying Statistical models in R ... 20

2.6.1 Examples ... 20

2.6.2 Linear models... 21

2.6.3 Generic functions for extracting model information ... 21

2.7 Graphical procedures ... 22

2.7.1 The plot() function ... 22

2.7.2 Displaying multivariate data ... 22

2.7.3 Advanced Graphics and the lattice package ... 23

3 Multilevel Analyses... 24

3.1 Attaching the multilevel and nlme packages ... 25

3.2 Multilevel data manipulation functions ... 25

3.2.1 The merge Function ... 25

3.2.2 The aggregate function ... 27

3.3 Within-Group Agreement and Reliability ... 28

3.3.1 Agreement: rwg, rwg(j), and r*wg(j) ... 29

3.3.2 The awg Index ... 31

(4)

3.3.4 Average Deviation (AD) Agreement using ad.m ... 35

3.3.5 Significance testing of AD using ad.m.sim ... 36

3.3.6 Agreement: Random Group Resampling ... 37

3.3.7 Reliability: ICC(1) and ICC(2) ... 41

3.3.8 Estimate multiple ICC values: mult.icc ... 42

3.3.9 Comparing ICC values with a two-stage bootstrap: boot.icc ... 42

3.3.10 Visualizing an ICC(1) with graph.ran.mean ... 43

3.3.11 Simulating ICC(1) values with sim.icc ... 44

3.4 Regression and Contextual OLS Models ... 46

3.4.1 Contextual Effect Example ... 46

3.5 Correlation Decomposition and the Covariance Theorem ... 48

3.5.1 The waba and cordif functions ... 49

3.5.2 Random Group Resampling of Covariance Theorem (rgr.waba) ... 50

4 Mixed-Effects Models for Multilevel Data ... 51

4.1 Steps in multilevel modeling ... 52

4.1.1 Step 1: Examine the ICC for the Outcome ... 52

4.1.2 Step 2: Explain Level 1 and 2 Intercept Variance ... 55

4.1.3 Step 3: Examine and Predict Slope Variance ... 57

4.2 Plotting an interaction with interaction.plot ... 63

4.3 Some Notes on Centering ... 63

4.4 Estimating Group-Mean Reliability (ICC2) with GmeanRel ... 65

5 Growth Modeling Repeated Measures Data ... 66

5.1 Methodological challenges ... 66

5.2 Data Structure and the make.univ Function ... 67

5.3 Growth Modeling Illustration ... 69

5.3.1 Step 1: Examine the DV ... 71

5.3.2 Step 2: Model Time ... 71

5.3.3 Step 3: Model Slope Variability ... 72

5.3.4 Step 4: Modeling Error Structures ... 72

5.3.5 Step 5: Predicting Intercept Variation... 75

5.3.6 Step 6: Predicting Slope Variation ... 76

5.4 Discontinuous Growth Models ... 77

5.5 Testing Emergence by Examining Error Structure ... 79

5.6 Empirical Bayes estimates ... 81

6 A brief introduction to lme4 ... 84

6.1 Dichotomous outcomes ... 84

6.2 Crossed and partially crossed models ... 86

6.3 Predicting values in lme4 ... 86

7 Miscellaneous Functions ... 87

7.1 Scale reliability: cronbach and item.total ... 87

7.2 Random Group Resampling for OLS Regression Models ... 87

7.3 Estimating bias in nested regression models: simbias ... 87

7.4 Detecting mediation effects: sobel ... 87

8 Conclusion ... 88

(5)

1 Introduction

This is an introduction to how R can be used to perform a wide variety of multilevel analyses. Multilevel analyses are applied to data that have some form of a nested structure. For instance, individuals may be nested within workgroups, or repeated measures may be nested within

individuals. Nested structures in data are often accompanied by some form of non-independence. For instance, in work settings, individuals in the same workgroup often display similar

performance and provide similar responses to questions about aspects of the work environment. Likewise, in repeated measures data, individuals typically display a high degree of similarity in responses over time. Non-independence may be considered either a nuisance variable or

something to be substantively understood, but the prevalence of nested data requires that analysts have a variety of tools to deal with nested data.

The term “multilevel analysis” is frequently used to describe a set of analyses also referred to as random coefficient models, random effects, and mixed-effects models (see Bryk & Raudenbush, 1992; Clark & Linzer, 2014; Kreft & De leeuw, 1998; Pinheiro & Bates, 2000; Raudenbush & Bryk, 2002; Snijders & Bosker, 1999). Mixed-effects models (the term primarily used in this document) are not without limitations (e.g., Clark & Linzer, 2014), but are generally well-suited for dealing with non-independence. Nonetheless, prior to the widespread use of mixed-effects models, analysts used a variety of techniques to examine data with nested structures. Furthermore, in certain areas such as organizational research, mixed-effects models are often augmented by tools designed to quantify within-group agreement and group-mean reliability. Therefore, my goal in writing this document is to introduce how R can cover a wide range of inter-related topics related to multilevel analyses including:

• Within-group agreement and reliability • Contextual and fixed-effects OLS models • Covariance theorem decomposition • Random Group Resampling

• Mixed Effects Models for nested group data

• Variants of Mixed Effects Models for Repeated Measures Data (Growth Modeling, Discontinuous Growth Modeling)

The wide variety of topics requires covering several “packages” written for R. The first of these packages is the R “stats” package. The “stats” package is automatically loaded and provides common statistics functions to estimate ANOVA (aov) and regression models (lm) used in contextual OLS and fixed-effects models.

In addition to the stats package, the manuscript relies heavily on the multilevel package. The multilevel package provides (a) tools to estimate a variety of within-group agreement and reliability measures, (b) data manipulation functions to facilitate multilevel and longitudinal analyses, and (c) a number of datasets to illustrate concepts.

Finally, the text makes considerable use of the non-linear and linear mixed-effects (nlme) model package, (Pinheiro & Bates, 2000). The nlme package provides functions to estimate a variety of mixed-effects models for both data nested in groups and for repeated measures data collected over time (growth models). Functions in nlme have remarkable flexibility, allowing

(6)

one to estimate a variety of alternative statistical models. This document also provides a very brief description of the lme4 package. The lme4 package was developed by Doug Bates and extends one’s ability to estimate mixed-effects models in several important ways (two are which are when one’s dependent variable is dichotomous and the other is when data are partially-crossed or fully partially-crossed instead of being fully nested).

This document begins with a brief introduction to R. The material in the introduction is in many cases lifted word-for-word from the document entitled “An Introduction to R” (see the copyright notice on page 2). This brief introduction is intended to give readers a feel for R, and readers familiar with R should feel free to skip this material. Following the introduction to R, the manuscript focuses on using R to conduct multilevel analyses.

2 An Introduction to R

2.1 Overview

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. Among other things it has:

• Effective data handling and storage facilities

• A suite of operators for calculations on arrays, in particular matrices • A large, integrated collection of tools for data analysis

• Graphical facilities for data analysis

• A well-developed and effective programming language 2.1.1 Related software and documentation

R shares many similarities with the S language developed at AT&T by Rick Becker, John Chambers and Allan Wilks. A number of the books and manuals about S bear some relevance to R.

The basic reference is The New S Language: A Programming Environment for Data Analysis

and Graphics by Richard A. Becker, John M. Chambers and Allan R. Wilks. The features of the

1991 release of S (S version 3) are covered in Statistical Models in S edited by John M. Chambers and Trevor J. Hastie. Both of these texts are highly useful for R users. 2.1.2 R and statistics

The developers of R think of it as an environment within which many classical and modern statistical techniques have been implemented. Some of these are built into the base R

environment, but many are supplied as packages. There are a number of packages supplied with R (called "standard" packages) and many more are available through the CRAN family of Internet sites (via http://cran.r-project.org).

There is an important difference in philosophy between R and the other main statistical systems. In R a statistical analysis is normally done as a series of steps with intermediate results stored in objects. Thus, whereas SAS and SPSS provide detailed output files from any specific

(7)

analysis, R provides minimal output and stores the results in a fit object for subsequent calls by functions such as summary.

2.1.3 Obtaining R and the multilevel package

The CRAN websites and mirrors (http: //cran.r-project.org) provide binary files for installing R in Windows (and other) computing environments. The base program and a number of default packages can be downloaded and installed using a single executable file (*.exe).

The base program is augmented by numerous packages. As of the writing of this manuscript, the nlme package is included with the base distribution; however, the multilevel package needs to be obtained using the "packages" GUI option in R. Other programs such as the

foreign package (for importing SPSS and other types of data) and the lattice package (for graphics) are included as part of the base distribution.

2.1.4 Data permanency and removing objects

In R, one works in an area called the “workspace.” The workspace is a working environment where objects are created and manipulated. Objects commonly kept in the workspace are (a) entire data sets (i.e. dataframes) and (b) the output of statistical analyses. It is also relatively common to keep programs (i.e., functions) that do special project-related tasks within the workspace.

The R commands > objects() or

> ls()

display the names of the objects in the workspace. As given above, the objects() command lists the objects in search position 1 corresponding to the workspace (or technically the

“.GlobalEnv” ). The open and closed parentheses containing no content are a shortcut for (1). It will later become apparent that it is often useful to list objects in other search positions.

Within the workspace, one removes objects using the rm function: > rm(x, y, ink, temp, foo)

It is important to keep in mind that there are functionally two states to the objects listed in the workspace. The first is permanently stored in the “.Rdata” file in the working directory and represents a previous save of the workspace. The second object state is anything created during the current session. These latter objects reside entirely in memory unless explicitly saved to the

workspace “.Rdata” file. In other words, if you fail to save the workspace after adding or

modifying objects you create in the current session, they will NOT be there next time you start R and load the specific workspace.

There are two ways to save current objects, both of which use the save.image function. First, one can use the “Save Workspace” option from the File menu to specify where to save the workspace. This option is GUI based, and allows the user to use a mouse to specify a location. The other option is to call the save.image function directly from the command line, as in:

(8)

In this case, the save.image function writes the objects in memory to the “Project 1.Rdata” file in the TEMP subdirectory on the F: Drive. If calling save.image directly, it is advisable to end the file name with ".RData" so that R recognizes the file as an R workspace.

2.1.5 Running R for Different Projects

As one develops proficiency with R, you will inevitably end up using R for multiple projects. It will become necessary, therefore, to keep separate workspaces. Each workspace will likely contain one or more related datasets, model results and programs written for specific project-related tasks.

For instance, I use R to analyze data files for manuscripts that are being written, revised and (theoretically) eventually published. Often because of the length of the review process, it may be several months before returning to a specific project. Consequently, I have found it helpful to store the R Workspace and analysis script in the same location as the manuscript so the data and statistical models supporting the manuscript are immediately at hand. To save workspaces, follow these steps:

1. Keep your initial workspace empty – no objects

2. Import the raw data (more on this later) and perform the analyses.

3. From the File menu, select “Save Workspace” and save the workspace in a project folder with a name of your choosing (but with an extension of .RData).

4. Use script files (command lines) and save scripts in the project folder as well.

By keeping separate workspaces and script files, data and code are available for subsequent analyses and there will be no need to import the data more than once.

2.1.6 Recall and correction of previous commands

Under Windows, R provides a mechanism for recalling and re-executing previous commands. The vertical arrow keys on the keyboard can be used to scroll forward and backward through a command history. Once a command is located in this way, the cursor can be moved within the command using the horizontal arrow keys, and characters can be removed with the DEL key or added with the other keys.

2.1.7 Getting help with functions and features

R has a built in help facility. To get more information on any specific named function, for example solve, the command is

> help(solve)

For a feature specified by special characters, the argument must be enclosed in double or single quotes, making it a "character string":

> help("[[")

Either form of quote mark may be used to escape the other, as in the string "It's important". Our convention is to use double quote marks for preference.

(9)

Searches of help files can be conducted using the help.search function. For instance, to find functions related to regression one would type:

> help.search("regression") 2.1.8 R commands, case sensitivity, etc.

Technically R is an expression language with a very simple syntax. It is case sensitive, so “A” and “a” are different symbols and would refer to different variables.

Elementary commands consist of either expressions or assignments. If an expression is given as a command, it is evaluated, printed, and the value is lost. An assignment also evaluates an expression and passes the value to a variable but the result is not automatically printed.

Commands are separated either by a semi-colon (‘;’), or by a new line. Elementary commands can be grouped together into one compound expression by braces (‘{’ .. ‘}’). Comments can be put almost anywhere, starting with a hashmark (‘#’), everything to the end of the line is a comment.

If a command is not complete at the end of a line, R will give a different prompt, by default +

on second and subsequent lines and continue to read input until the command is syntactically complete. In providing examples, this document will generally omit the continuation prompt and indicate continuation by simple indenting.

2.2 Simple manipulations; numbers and vectors

2.2.1 Vectors and assignment

R operates on named data structures. The simplest such structure is the numeric vector, which is a single entity consisting of an ordered collection of numbers. To set up a vector named x, say, consisting of five numbers, namely 10.4, 5.6, 3.1, 6.4 and 21.7, use the R command

> x <- c(10.4, 5.6, 3.1, 6.4, 21.7)

This is an assignment statement using the function c() which in this context can take n arbitrary number of vector arguments and whose value is a vector gotten by concatenating its arguments end to end.

A number occurring by itself in an expression is taken as a vector of length one. Notice that the assignment operator (‘<-‘) consists of the two characters ‘<’ (“less than”) and ‘-’(“minus”) occurring strictly side-by-side and it ‘points’ to the object receiving the value of the expression. In current versions of R, assignments can also be made using the = sign.

> x=c(10.4, 5.6, 3.1, 6.4, 21.7)

Assignments can also be made in the other direction, using the obvious change in the assignment operator. So the same assignment could be made using

(10)

If an expression is used as a complete command, the value is printed and lost. So now if we were to issue the command

> 1/x

the reciprocals of the five values would be printed at the screen (and the value of x, of course, unchanged). The further assignment

> y <- c(x, 0, x)

would create a vector y with 11 entries consisting of two copies of x with a zero in the middle place.

2.2.2 Missing values

In some cases the components of a vector may not be completely known. When an element or value is “not available” or a “missing value” in the statistical sense, a place within a vector may be reserved for it by assigning it the special value NA. In general, any operation on an NA becomes an NA. The motivation for this rule is simply that if the specification of an operation is incomplete, the result cannot be known and hence is not available.

Many of the functions in R have options for handling missing values such as

na.action=na.omit or na.rm=T (both of which remove or omit the missing values and run the analyses on the non-missing data). Details on how to handle missing values are found in the help files associated with specific functions.

Most of the functions in the multilevel package (that we will discuss in detail later) require data that have no missing values. To create such data, one can make use of the na.exclude function. The object returned from na.exclude is a new dataframe that has listwise deletion of missing values. So

> TDATA<-na.exclude(DATA)

will produce a dataframe TDATA that contains no missing values. The TDATA dataframe can then be used subsequent analyses. Practically speaking, it rarely makes sense to use

na.exclude on an entire dataframe; rather, one typically selects a subset of variables upon which to apply na.exclude such as na.exclude(DATA[,c("var1","var2")]). We discuss dataframes and how to select parts of a dataframe in more detail in the next section.

2.3 Dataframes

2.3.1 Introduction to dataframes

A dataframe is an object that stores data. Dataframes have multiple columns representing different variables and multiple rows representing different observations. The columns can be numeric vectors or non-numeric vectors, however each column must have the same number of observations. Thus, for all practical purposes one can consider dataframes to be spreadsheets with the limitation that each column must have the same number of observations.

Dataframes may be displayed in matrix form, and its rows and columns extracted using matrix indexing conventions. This means, for example, that one can access specific rows and columns

(11)

of a dataframe using brackets [rows, columns]. For example to access rows 1-3 and all columns of a dataframe object named TDAT

> TDAT[1:3,]

To access rows 1:3 and columns 1,5 and 8 > TDAT[1:3,c(1,5,8)]

We will consider matrix bracket manipulations in more detail with a specific example in section 2.5.

2.3.2 Making dataframes

Dataframes can be created using the data.frame function. The following example makes a dataframe object called accountants.

> accountants<-data.frame(home=c("MD","CA","TX"),income=c(45000, 55000,60000),car=c("honda","acura","toyota"))

> accountants

home income car 1 MD 45000 honda 2 CA 55000 acura 3 TX 60000 toyota

The $ operator can be used to access specific components of dataframes. For instance, accountants$car returns the car column within the dataframe accountants. In practice, one will generally make dataframes from existing files using data importing functions such as read.table, read.csv or read.spss. These functions read data sets from external files and create dataframes. We discuss these types of functions in section 2.4. 2.3.3 Managing the search path

The function search shows the current search path and so is a useful way to keep track of what has been attached. Initially, it gives the global environment in search position 1 followed by various packages that are automatically loaded (actual results may vary depending upon the specific version of R).

> search()

[1] ".GlobalEnv" "package:methods" "package:stats" [4] "package:graphics" "package:utils" "Autoloads" [7] "package:base"

where .GlobalEnv is the workspace. Basically, the search path means that if you type in an object such as car the program will look for something named car first in the workspace, then in the package methods, then in the package stats, etc. Because car does not exist in any of these places, the following error message will be returned:

> car

Error: Object "car" not found

(12)

> attach(accountants) > search()

[1] ".GlobalEnv" "accountants" "package:methods" [4] "package:stats" "package:graphics" "package:utils" [7] "Autoloads" "package:base"

In this case, typing car at the command prompt returns: > car

[1] honda acura toyota Levels: acura honda toyota

It is often useful to see what objects exist within various components of the search path. The function objects() with the search position of interest in the parentheses can be used to examine the contents of any object in the search path. For instance to see the contexts of search position 2 one types:

> objects(2)

[1] "car" "home" "income"

Finally, we detach the dataframe and confirm it has been removed from the search path. > detach("accountants")

> search()

[1] ".GlobalEnv" "package:methods" "package:stats" [4] "package:graphics" "package:utils" "Autoloads" [7] "package:base"

While I have used attach() and detach() to illustrate search(), I strongly

recommend that users do not attach dataframes and rely instead on the $ operator and the data option within most functions. In my experience, it is easy to attach a dataframe, forget and then inadvertently apply a series of analyses to the wrong dataframe.

2.4 Reading data from files

In R sessions, large data objects will almost always be read from external files and stored as dataframes. There are several options available to read external files.

If variables are stored in spreadsheets such as EXCEL, entire dataframes can be read directly using the function read.table() and variants such as read.csv() and read.delim(). The help file for read.table() discusses the details of the variants of read.table().

If variables are stored in other statistical packages such as SPSS or SAS, then the foreign package provides useful functions for importing the data. This document will illustrate

importing spreadsheet data and SPSS data. 2.4.1 Reading Spreadsheet (EXCEL) data

External spreadsheets normally have this form.

• The first line of the file has a name for each variable. • Each additional line of the file has values for each variable. So the first few lines of a spreadsheet data might look as follows.

(13)

UNIT PLATOON COH01 COH02 COH03 COH04 COH05 1044B 1ST 4 5 5 5 5 1044B 1ST 3 NA 5 5 5 1044B 1ST 2 3 3 3 3 1044B 2ND 3 4 3 4 4 1044B 2ND 4 4 3 4 4 1044B 2ND 3 3 2 2 1 1044C 1ST 3 3 3 3 3 1044C 1ST 3 1 4 3 4 1044C 2ND 3 3 3 3 3 1044C 2ND 2 2 2 3 2 1044C 2ND 1 1 1 3 3

One of the most reliable ways to import any type of data into R is to use EXCEL to process the data file into a comma delimited (*.csv) format. Note that most statistical packages (SAS, SPSS) can save data as an EXCEL file. Users who use SPSS and export data to EXCEL may encounter the error type value marker "#NULL!" for missing values. This value should be changed to NA as under the second entry under COH02 in the example above to avoid problems in R.

Once the comma delimited file is created using the “Save As” feature in EXCEL one can import it into R using either the read.table() or the read.csv() function. For instance, if the file above is saved as “cohesion.csv” in the root directory of C: (C:\) the function

read.table() can be used to read the dataframe directly

>cohesion<-read.table("c:\\cohesion.csv", header=TRUE, sep=",") Alternatively, one can use read.csv()

>cohesion<-read.csv("c:\\cohesion.csv")

Note that subdirectories are designated using the double slash instead of a single slash, also recall that R is case sensitive. Finally note the default for read.csv is header=TRUE so that option can be omitted.

A final alternative discussed in more detail in section 2.4.4 is to use file.choose() to avoid having to specify the path as in:

>cohesion<-read.csv(file.choose())

Using file.choose() opens the graphic user interface (gui) so one can select the file using a mouse or other device. This option is particularly useful when data are stored in complex network file structures.

Typing in the name of the cohesion object displays all of the data: > cohesion

UNIT PLATOON COH01 COH02 COH03 COH04 COH05 1 1044B 1ST 4 5 5 5 5 2 1044B 1ST 3 NA 5 5 5 3 1044B 1ST 2 3 3 3 3

(14)

4 1044B 2ND 3 4 3 4 4 5 1044B 2ND 4 4 3 4 4 6 1044B 2ND 3 3 2 2 1 7 1044C 1ST 3 3 3 3 3 8 1044C 1ST 3 1 4 3 4 9 1044C 2ND 3 3 3 3 3 10 1044C 2ND 2 2 2 3 2 11 1044C 2ND 1 1 1 3 3

2.4.2 The extremely useful "clipboard" option

In R, users can directly read and write data to a Windows clipboard to export and import data into EXCEL and other programs without saving intermediate files.

For instance, to read cohesion into R directly from EXCEL, one would: 1. Open the cohesion.xls file in EXCEL

2. Select and copy the relevant cells in Windows (Ctrl-C)

3. Issue the R command (important to issue the command from the console and NOT a script file. If you issue the command from a script file the command itself goes into the clipboard):

> cohesion<-read.table(file="clipboard",sep="\t",header=T)

The file "clipboard" instructs read.table to read the file from the Windows clipboard, and the separator option of "\t" indicates that elements are separated by tabs. In general, blank cells in EXCEL are interpreted as missing values; however, if columns are imported as factors instead of numeric vectors, it is often because of how missing values are coded in EXCEL, so you may need to convert missing cells to NA in some cases (or alternatively convert NA entries into blank cases).

Because the "clipboard" option also works with write.table, (see section 2.4.7) it is also a useful way to export the results of data analyses to EXCEL or other programs. For instance, if we create a correlation matrix from the cohesion data set, we can export this correlation table directly to EXCEL.

> CORMAT<-cor(cohesion[,3:7],use="pairwise.complete.obs") > CORMAT

COH01 COH02 COH03 COH04 COH05 COH01 1.0000000 0.7329843 0.6730782 0.4788431 0.4485426 COH02 0.7329843 1.0000000 0.5414305 0.6608190 0.3955316 COH03 0.6730782 0.5414305 1.0000000 0.7491526 0.7901837 COH04 0.4788431 0.6608190 0.7491526 1.0000000 0.9036961 COH05 0.4485426 0.3955316 0.7901837 0.9036961 1.0000000 > write.table(CORMAT,file="clipboard",sep="\t",col.names=NA)

Going to EXCEL and issuing the Windows "paste" command (or Ctrl-V) will insert the matrix into the EXCEL worksheet. Note the somewhat counter-intuitive use of

col.names=NA in this example. This command does not mean omit the column names (achieved using col.names=F); instead the command puts an extra blank in the first row of

(15)

the column names to line up the column names with the correct columns. Alternatively, one can use the option row.names=F to omit the row numbers.

In certain cases, written objects may be too large for the default memory limit of the Window’s clipboard. For instance, if writes the full bh1996 dataset from the multilevel package into the clipboard with the intent of writing it to EXCEL, the following error (truncated) is returned:

> library(multilevel)

> data(b1996) #Bring data from the library to the workspace > write.table(bh1996,file="clipboard",sep="\t",col.names=NA) Warning message:

In write.table(x, file, nrow(x),... as.integer(quote), : clipboard buffer is full and output lost

To increase the size of the clipboard to 1.5MG (or any other arbitrary size), the

"clipboard" option can be modified as follows: "clipboard-1500". Note that the options surrounding the use of the clipboard are specific to various operating systems and may change with different versions of R so it will be worth periodically referring to the help files.

2.4.3 The foreign package and SPSS files

Included in current versions of R is the foreign package. This package contains functions to import SPSS, SAS, Stata and minitab files.

> library(foreign) > search()

[1] ".GlobalEnv" "package:foreign" "package:multilevel" [4] "package:methods" "package:stats" "package:graphics" [7] "package:grDevices" "package:utils" "package:datasets" [10] "Autoloads" "package:base"

> objects(2)

[1] "data.restore" "lookup.xport" "read.dbf" "read.dta" [5] "read.epiinfo" "read.mtp" "read.octave" "read.S" [9] "read.spss" "read.ssd" "read.systat" "read.xport" [13] "write.dbf" "write.dta" "write.foreign"

For example, if the data in cohesion is stored in an SPSS sav file in a TEMP directory, then one could issue the following command to read in the data (text following the # mark is a comment):

> help(read.spss) #look at the documentation on read.spss > cohesion2<-read.spss("c:\\temp\\cohesion.sav")

> cohesion2 #look at the cohesion object

$UNIT

[1] "1044B" "1044B" "1044B" "1044B" "1044B" "1044B" "1044C" "1044C" "1044C" [10] "1044C" "1044C"

$PLATOON

(16)

$COH01 [1] 4 3 2 3 4 3 3 3 3 2 1 $COH02 [1] 5 NA 3 4 4 3 3 1 3 2 1 $COH03 [1] 5 5 3 3 3 2 3 4 3 2 1 $COH04 [1] 5 5 3 4 4 2 3 3 3 3 3 $COH05 [1] 5 5 3 4 4 1 3 4 3 2 3 attr(,"label.table") attr(,"label.table")$UNIT NULL attr(,"label.table")$PLATOON NULL attr(,"label.table")$COH01 NULL attr(,"label.table")$COH02 NULL attr(,"label.table")$COH03 NULL attr(,"label.table")$COH04 NULL attr(,"label.table")$COH05 NULL

The cohesion2 object is stored as a list rather than a dataframe. With the default options, read.spss function imports the file as a list and reads information about data labels. In almost every case, users will want to convert the list object into a dataframe for manipulation in R. This can be done using the data.frame command.

> cohesion2<-data.frame(cohesion2) > cohesion2

UNIT PLATOON COH01 COH02 COH03 COH04 COH05 1 1044B 1ST 4 5 5 5 5 2 1044B 1ST 3 NA 5 5 5 3 1044B 1ST 2 3 3 3 3 4 1044B 2ND 3 4 3 4 4 5 1044B 2ND 4 4 3 4 4 6 1044B 2ND 3 3 2 2 1 7 1044C 1ST 3 3 3 3 3 8 1044C 1ST 3 1 4 3 4 9 1044C 2ND 3 3 3 3 3 10 1044C 2ND 2 2 2 3 2 11 1044C 2ND 1 1 1 3 3

Alternatively, users can change the default options in read.spss to read the data directly into a dataframe. Note the use of use.value.labels=F and to.data.frame=T below:

> cohesion2<-read.spss("c:\\temp\\cohesion.sav", use.value.labels=F, to.data.frame=T)

> cohesion2

UNIT PLATOON COH01 COH02 COH03 COH04 COH05 1 1044B 1ST 4 5 5 5 5 2 1044B 1ST 3 NA 5 5 5 3 1044B 1ST 2 3 3 3 3

(17)

4 1044B 2ND 3 4 3 4 4 5 1044B 2ND 4 4 3 4 4 6 1044B 2ND 3 3 2 2 1 7 1044C 1ST 3 3 3 3 3 8 1044C 1ST 3 1 4 3 4 9 1044C 2ND 3 3 3 3 3 10 1044C 2ND 2 2 2 3 2 11 1044C 2ND 1 1 1 3 3

The cohesion dataframe (made using the EXCEL and csv files) and cohesion2 (imported from SPSS) are now identical.

2.4.4 Using file.choose to bring up a GUI to read data

One limitation with using command lines to specify where files are located is that in complex directory structures it can be hard to specify the correct location of the data. For instance, if data are embedded several layers deep in subdirectories in a network drive, it may be difficult to specify the path. In these cases, the file.choose function is a useful way to identify the file. The file.choose function opens a Graphical User Interface (GUI) dialogue box allowing one to select files using the mouse. The choose.files function can be embedded within any function where one has to specifically identify a file. So, for instance, one can use

file.choose with read.spss:

> cohesion2<-read.spss(file.choose(),use.value.labels=F, to.data.frame=T) Notice how "file.choose()" replaces "c:\\temp\\cohesion.sav" used in the final example in section 2.4.3. With the use of file.choose a GUI dialogue box opens allowing one to select a specific SPSS sav file.

2.4.5 Checking your dataframes with str , summary, and head

With small data sets it is easy to verify that the data has been read in correctly. Often, however, one will be working with large data sets that are difficult to visual verify.

Consequently, functions such as str (structure), summary and head provide easy ways to examine dataframes.

> str(cohesion)

`data.frame': 11 obs. of 7 variables:

$ UNIT : Factor w/ 2 levels "1044B","1044C": 1 1 1 1 1 1 2 2 2 2 ... $ PLATOON: Factor w/ 2 levels "1ST","2ND": 1 1 1 2 2 2 1 1 2 2 ... $ COH01 : int 4 3 2 3 4 3 3 3 3 2 ... $ COH02 : int 5 NA 3 4 4 3 3 1 3 2 ... $ COH03 : int 5 5 3 3 3 2 3 4 3 2 ... $ COH04 : int 5 5 3 4 4 2 3 3 3 3 ... $ COH05 : int 5 5 3 4 4 1 3 4 3 2 ... > summary(cohesion)

UNIT PLATOON COH01 COH02 COH03 1044B:6 1ST:5 Min. :1.000 Min. :1.00 Min. :1.000 1044C:5 2ND:6 1st Qu.:2.500 1st Qu.:2.25 1st Qu.:2.500 Median :3.000 Median :3.00 Median :3.000 Mean :2.818 Mean :2.90 Mean :3.091 3rd Qu.:3.000 3rd Qu.:3.75 3rd Qu.:3.500

(18)

Max. :4.000 Max. :5.00 Max. :5.000 NA's :1.00 COH04 COH05 Min. :2.000 Min. :1.000 1st Qu.:3.000 1st Qu.:3.000 Median :3.000 Median :3.000 Mean :3.455 Mean :3.364 3rd Qu.:4.000 3rd Qu.:4.000 Max. :5.000 Max. :5.000

> head(cohesion) #list the first six rows of data in a dataframe UNIT PLATOON COH01 COH02 COH03 COH04 COH05

1 1044B 1ST 4 5 5 5 5 2 1044B 1ST 3 NA 5 5 5 3 1044B 1ST 2 3 3 3 3 4 1044B 2ND 3 4 3 4 4 5 1044B 2ND 4 4 3 4 4 6 1044B 2ND 3 3 2 2 1

2.4.6 Loading data from packages

One of the useful attributes of R is that the data used in the examples are almost always available to the user. These data are associated with specific packages. For instance, the multilevel package uses a variety of data files to illustrate specific functions. To gain access to these data, one uses the data command:

>data(package="multilevel")

This command lists the data sets associated with the multilevel package, and the command >data(bh1996, package="multilevel")

copies the bh1996 data set to the workspace making it possible to work with the bhr2000 dataframe.

If a package has been attached by library, its datasets are automatically included in the search, so that

>library(multilevel) attaches the multilevel package;

>data()

lists all of available data sets in the multilevel package and in other packages, and >data(bh1996)

copies the data from the package to the workspace without requiring explicit specification of the package.

2.4.7 Exporting data to spreadsheets using write() and write.table()

As noted previously, there are likely to be occasions when it is useful to export data from R to spreadsheets. There are two functions that are useful for exporting data -- the write function and the write.table function. The write function is useful when one wants to export a

(19)

vector while the write.table function is useful for exporting dataframes or matrices. Below both will be illustrated.

Let us assume that we were interested in calculating the average hours worked for the 99 companies in the bh1996 data set, and then exporting these 99 group means to a spreadsheet. To calculate the vector of 99 group means and write them out to a file we can issue the following commands:

> HRSMEANS<-tapply(bh1996$HRS,bh1996$GRP,mean)

> write(HRSMEANS,file="c:\\temp\\ghours.txt",ncolumns=1)

The tapply command subdivides HRS by GRP, and then performs the function mean on the HRS data for each group. This command is similar to the aggregate function that will be discussed in more detail in section 3.2.2. The write function takes the 99 group means stored in the object HRSMEANS, and writes them to a file in the "c:\temp" subdirectory called

ghours.txt. It is important to use the ncolumns=1 option or else the write function will default to five columns. The ghours.txt file can be read into any spreadsheet as a vector of 99 values.

The write.table function is similar to the write function, except that one must specify the character value that will be used to separate columns. Common choices include tabs

(designated as \t) and commas. Of these two common choices, commas are likely to be most useful in exporting dataframes or matrices to spreadsheets because programs like Microsoft EXCEL automatically read in comma delimited or csv files. Below I export the entire bh1996 dataframe to a comma delimited file that can be read directly into Microsoft EXCEL.

> write.table(bh1996,file="c:\\temp\\bhdat.csv",sep=",",row.names=F) Notice the use of the sep="," option and also the row.names=F option. The

row.names=F stops the program from writing an additional column of row names typically stored as a vector from 1 to the number of rows. Omitting this column is important because it ensures that the column names match up with the correct columns. Recall from section 2.4.2 that one can use the "file=clipboard" option to directly write to Window's clipboard.

2.5 More on using matrix brackets on dataframes

At this point, it may be useful to reconsider the utility of using matrix brackets to access various parts of cohesion (see also section 2.3.1). While this may initially appear cumbersome, mastering the use of matrix brackets provides considerable control over ones' dataframe.

Recall that one accesses various parts of the dataframe via [rows, columns]. So, for instance, we can access rows 1,5,and 8 and columns 3 and 4 of the cohesion dataframe as follows:

> cohesion[c(1,5,8),3:4] COH01 COH02

1 4 5 5 4 4 8 3 1

(20)

> cohesion[c(1,5,8),c("COH01","COH02")] COH01 COH02

1 4 5 5 4 4 8 3 1

It is often useful to pick specific rows that meet some criteria. So, for example, we might want to pick rows that are from the 1ST PLATOON

> cohesion[cohesion$PLATOON=="1ST",]

UNIT PLATOON COH01 COH02 COH03 COH04 COH05 1 1044B 1ST 4 5 5 5 5 2 1044B 1ST 3 NA 5 5 5 3 1044B 1ST 2 3 3 3 3 7 1044C 1ST 3 3 3 3 3 8 1044C 1ST 3 1 4 3 4

Upon inspection, we might want to further refine our choice and exclude missing values. We do this by adding another condition using AND operator "&"

> cohesion[cohesion$PLATOON=="1ST"&is.na(cohesion$COH02)==F,] UNIT PLATOON COH01 COH02 COH03 COH04 COH05

1 1044B 1ST 4 5 5 5 5 3 1044B 1ST 2 3 3 3 3 7 1044C 1ST 3 3 3 3 3 8 1044C 1ST 3 1 4 3 4

Using matrix brackets, one can easily and quickly specify particular portions of a dataframe that are of interest.

2.6 Identifying Statistical models in R

This section presumes the reader has some familiarity with statistical methodology, in particular with regression analysis and the analysis of variance. Almost all statistical models from ANOVA to regression to mixed-effects models are specified in a common format. The format is DV ~ IV1+IV2+IV3. In a regression model this dictates that the dependent variable (DV) will be regressed on three independent variables. By using + between the IV's, the model is requesting only main effects. If the IVs were separated by the * sign, it would designate both main effects and interactions (all two and three-way interactions in this case).

2.6.1 Examples

A few examples may be useful in illustrating some other aspects of model specification. Suppose y, x, x0, x1 and x2 are numeric variables, and A, B, and C are factors or categorical variables. The following formulae on the left side below specify statistical models as described on the right.

y ~ x

y ~ 1 + x Both imply the same simple linear regression model of y on x. The first has an implicit

intercept term, and the second an explicit one.

y ~ A Single classification analysis of variance model of y, with classes determined by A. Basically a one-way analysis of variance.

(21)

y ~ A + x Single classification analysis of covariance model of y, with classes determined by A,

and with covariate x. Basically an analysis of covariance.

2.6.2 Linear models

The basic function for fitting ordinary multiple regression models is lm(), and a streamlined version of the call is as follows:

> fitted.model <- lm(formula, data = data.frame) For example

> fm2 <- lm(y ~ x1 + x2, data = production)

would fit a multiple model regressing y on x1 and x2 (with implicit intercept term). The important option data = production specifies where the variables are to be found. 2.6.3 Generic functions for extracting model information

The object created by lm() is a fitted model object; technically a list of results of class "lm". Information about the fitted model can then be displayed, extracted, plotted and so on by using generic functions that orient themselves to objects of class "lm". These include:

add1 coef effects kappa predict residuals alias deviance family labels print step anova drop1 formula plot proj summary A brief description of the most commonly used ones is given below. coefficients(object)

Extract the regression coefficients. Short form: coef(object).

plot(object)

Produce four plots, showing residuals, fitted values and some diagnostics.

predict(object, newdata=data.frame)

The dataframe supplied must have variables specified with the same labels as the original. The value is a vector or matrix of predicted values corresponding to the determining variable values in data.frame.

print(object)

Print a concise version of the object. Most often used implicitly.

residuals(object)

Extract the (matrix of) residuals, weighted as appropriate. Short form: resid(object).

summary(object)

Print a comprehensive summary of the results of the regression analysis. The summary function is widely used to extract more information from objects whether the objects are dataframes or products of statistical functions.

(22)

2.7 Graphical procedures

Graphical facilities are an important and extremely versatile component of the R environment. It is possible to use the facilities to display a wide variety of statistical graphs and also to build entirely new types of graphs. The graphics facilities can be used in both interactive and batch modes, but in most cases, interactive use is more productive. Interactive use is also easy because at startup time R initiates a graphics device driver that opens a special graphics window for the display of interactive graphics. Although this is done automatically, it is useful to know that the command used is windows() under Windows. Once the device driver is running, R plotting commands can be used to produce a variety of graphical displays and to create entirely new kinds of display.

2.7.1 The plot() function

One of the most frequently used plotting functions in R is the plot() function. This is a generic function: the type of plot produced is dependent on the type or class of the first argument.

plot(x, y) If x and y are vectors, plot(x, y) produces a scatterplot of y against x.

plot(df)

plot(~ a+b+c, data=df) plot(y ~ a+b+c, data=df)

where df is a dataframe. The first example produces scatter plots of all of the variables in a dataframe. The second produces scatter plots for just the three named variables (a, b and c). The third example plots y against a, b and c.

2.7.2 Displaying multivariate data

R provides two very useful functions for representing multivariate data. If X is a numeric matrix or dataframe, the command

> pairs(X)

produces a pairwise scatterplot matrix of the variables defined by the columns of X, that is, every column of X is plotted against every other column of X and the resulting n(n - 1) plots are

arranged in a matrix with plot scales constant over the rows and columns of the matrix.

When three or four variables are involved a coplot may be more enlightening. If a and b are numeric vectors and c is a numeric vector or factor object (all of the same length), then

> coplot(a ~ b | c)

produces a number of scatterplots of a against b for given values of c. If c is a factor, this simply means that a is plotted against b for every level of c. When c is numeric, it is divided into a number of conditioning intervals and for each interval a is plotted against b for values of c within the interval. The number and position of intervals can be controlled with given.values= argument to coplot() -- the function co.intervals() is useful for selecting intervals. You can also use two given variables with a command like

(23)

which produces scatterplots of a against b for every joint conditioning interval of c and d. The coplot() and pairs() function both take an argument panel= which can be used to customize the type of plot which appears in each panel. The default is points() to produce a scatterplot but by supplying some other low-level graphics function of two vectors x and y as the value of panel= you can produce any type of plot you wish. An example panel function useful for coplots is panel.smooth().

2.7.3 Advanced Graphics and the lattice package

An advanced graphics package called lattice is included with the base program. The lattice package is an implementation of trellis graphics designed specifically for R that provides presentation quality graphics. Below is an example involving creating a histogram of 1000 random numbers.

> library(lattice)

> histogram(rnorm(1000),nint=30,xlab="1000 Random Numbers", col="sky blue")

Another example taken from Bliese and Halverson (2002) provides an even better demonstration of the graphics capabilities of R and the lattice package. This example illustrates a two-way interaction on a three dimensional surface.

> library(multilevel) > data(lq2002) > TDAT<-lq2002[!duplicated(lq2002$COMPID),] > tmod<-lm(GHOSTILE~GLEAD*GTSIG,data=TDAT) > TTM<-seq(min(TDAT$GLEAD),max(TDAT$GLEAD),length=25) > TTV<-seq(min(TDAT$GTSIG),max(TDAT$GTSIG),length=25) > TDAT2<-list(GLEAD=TTM,GTSIG=TTV) > grid<-expand.grid(TDAT2) > fit<-predict(tmod,grid) 1000 Random Numbers P e rce n t o f T o ta l 0 2 4 6 8 10 -2 0 2

(24)

> wireframe(fit~GLEAD*GTSIG, data=grid,col="steelblue4", screen = list(z = -30, x = -60),

xlab=list("Leadership \n Climate",

cex=1.5),ylab=list(" Task \n Significance",cex=1.5), zlab=list("Hostility ",cex=1.5),scales=list(arrows=F), shade=T,colorkey=F) #or use drape=T instead of shade=T

3 Multilevel Analyses

The remainder of this document illustrates how R can be used in multilevel modeling

beginning with several R functions particularly useful for preparing data for subsequent analyses. Following data preparation, the manuscript covers:

• Within-group agreement and reliability • Contextual and fixed-effects OLS models • Covariance theorem decomposition

• Mixed Effects Models for nested group data

• Variants of Mixed Effects Models for Repeated Measures Data (Growth Modeling, Discontinuous Growth Modeling)

(25)

The discussion of within-group agreement and the covariance theorem decomposition also includes sections on Random Group Resampling (or RGR). RGR is a resampling technique that is useful in contrasting actual group results to pseudo-group results (see Bliese & Halverson, 2002; Bliese, Halverson & Rothberg, 2000).

3.1 Attaching the multilevel and nlme packages

Many of the features in the following sections assume that the multilevel and nlme packages are accessible in R. Recall that multilevel package is not distributed with the base installation and needs to be retrieved using the "packages" GUI option in R. Also recall that once retrieved, the package is attached in R using the library command:

> library(multilevel)

By default, the nlme and MASS packages are loaded when the multilevel package is loaded as several of the functions in the multilevel package depend on nlme and MASS.

3.2 Multilevel data manipulation functions

3.2.1 The merge Function

One of the key data manipulation tasks that must be accomplished prior to estimating several of the multilevel models (specifically contextual models and mixed-effects models) is that group-level variables must be “assigned down” to the individual. To make a dataframe containing both individual and group-level variables, one typically begins with two separate dataframes. One dataframe contains individual-level data, and the other dataframe contains group-level data. By combining these two dataframes using a group identifying variable common to both, one is able to create a single data set containing both individual and group data. In R, combining dataframes is accomplished using the merge function.

For instance, consider the cohesion data introduced when showing how to read data from external files. The cohesion data is included as a multilevel data set, so we can use the data function to bring it from the multilevel package to the working environment without having to use read.csv or read.table (see section 2.4.1).

> data(package="multilevel")

Data sets in package ‘multilevel’:

bh1996 Data from Bliese and Halverson (1996)

bhr2000 Data from Bliese, Halverson and Rothberg (2000) chen2005 Data from Chen (2005)

cohesion Five cohesion ratings from 11 individuals nested in 4 platoons in 2 larger units klein2000 Data from Klein, Bliese, Kozlowski et al., (2000)

lq2002 Data used in special issue of Leadership Quarterly, Vol. 13, 2002

sherifdat Sherif (1935) group data from 3 person teams tankdat Tank data from Bliese and Lang (in press) univbct Data from Bliese and Ployhart (2002)

(26)

>data(cohesion) >cohesion

UNIT PLATOON COH01 COH02 COH03 COH04 COH05 1 1044B 1ST 4 5 5 5 5 2 1044B 1ST 3 NA 5 5 5 3 1044B 1ST 2 3 3 3 3 4 1044B 2ND 3 4 3 4 4 5 1044B 2ND 4 4 3 4 4 6 1044B 2ND 3 3 2 2 1 7 1044C 1ST 3 3 3 3 3 8 1044C 1ST 3 1 4 3 4 9 1044C 2ND 3 3 3 3 3 10 1044C 2ND 2 2 2 3 2 11 1044C 2ND 1 1 1 3 3

Now assume that we have another dataframe with platoon sizes. We can create this dataframe as follows:

> group.size<-data.frame(UNIT=c("1044B","1044B","1044C","1044C"), PLATOON=c("1ST","2ND","1ST","2ND"),PSIZE=c(3,3,2,3))

> group.size #look at the group.size dataframe UNIT PLATOON PSIZE

1 1044B 1ST 3 2 1044B 2ND 3 3 1044C 1ST 2 4 1044C 2ND 3

To create a single file (new.cohesion) that contains both individual and platoon information, use the merge command.

> new.cohesion<-merge(cohesion,group.size,by=c("UNIT","PLATOON")) > new.cohesion

UNIT PLATOON COH01 COH02 COH03 COH04 COH05 PSIZE 1 1044B 1ST 4 5 5 5 5 3 2 1044B 1ST 3 NA 5 5 5 3 3 1044B 1ST 2 3 3 3 3 3 4 1044B 2ND 3 4 3 4 4 3 5 1044B 2ND 4 4 3 4 4 3 6 1044B 2ND 3 3 2 2 1 3 7 1044C 1ST 3 3 3 3 3 2 8 1044C 1ST 3 1 4 3 4 2 9 1044C 2ND 3 3 3 3 3 3 10 1044C 2ND 2 2 2 3 2 3 11 1044C 2ND 1 1 1 3 3 3

Notice that every individual now has a value for PSIZE – a value that reflects the number of individuals in the platoon.

In situations where there is a single unique group identifier, the by option can be simplified to include just one variable. For instance, if the group-level data had reflected values for each UNIT instead of PLATOON nested in unit, the by option would simply read by="UNIT". In the case of PLATOON, however, there are numerous platoons with the same name (1ST, 2ND), so unique platoons need to be identified within the nesting of the larger UNIT.

(27)

3.2.2 The aggregate function

In many cases in multilevel analyses, one will be interested in creating a group-level variable from individual responses. For example, one might be interested in calculating the group mean and reassigning it back to the individual. In these cases, the aggregate function in

combination with the merge function is highly useful. In our cohesion example, for instance, we want to assign platoon means for the variables COH01 and COH02 back to the individuals. The first step in this process is to create a group-level file using the aggregate function. The aggregate function has three key arguments. The first argument is a vector or matrix of variables that one wants to convert to group-level variables. Second is the grouping variable(s) included as a list, and third is the function (mean, var, length, etc.) executed on the

variables. To calculate the means of COH01 and COH02 (columns 3 and 4 of the cohesion dataframe) issue the command:

> TEMP<-aggregate(cohesion[,3:4],list(cohesion$UNIT,cohesion$PLATOON),mean) > TEMP

Group.1 Group.2 COH01 COH02 1 1044B 1ST 3.000000 NA 2 1044C 1ST 3.000000 2.000000 3 1044B 2ND 3.333333 3.666667 4 1044C 2ND 2.000000 2.000000

Notice that COH02 has an “NA” value for the mean. The NA value occurs because there was a missing value in the individual-level file. If we decide to base the group mean on the non-missing individual values from group members we can add the parameter na.rm=T, to designate that NA values should be removed prior to calculating the group mean.

> TEMP<-aggregate(cohesion[,3:4],list(cohesion$UNIT,cohesion$PLATOON), mean,na.rm=T)

> TEMP

Group.1 Group.2 COH01 COH02 1 1044B 1ST 3.000000 4.000000 2 1044C 1ST 3.000000 2.000000 3 1044B 2ND 3.333333 3.666667 4 1044C 2ND 2.000000 2.000000

To merge the TEMP dataframe with the new.cohesion dataframe, we can change the names of the group identifiers in the TEMP frame to match the group identifiers in the new.cohesion dataframe. We also want to change the names of COH01 and COH02 to reflect the fact that they are group means. We will use “G.” to designate group mean.

> names(TEMP)<-c("UNIT","PLATOON","G.COH01","G.COH02")

Finally, we merge TEMP up with new.cohesion to get the complete data set. > final.cohesion<-merge(new.cohesion,TEMP,by=c("UNIT","PLATOON")) > final.cohesion

UNIT PLATOON COH01 COH02 COH03 COH04 COH05 PSIZE G.COH01 G.COH02 1 1044B 1ST 4 5 5 5 5 3 3.000000 4.000000 2 1044B 1ST 3 NA 5 5 5 3 3.000000 4.000000 3 1044B 1ST 2 3 3 3 3 3 3.000000 4.000000 4 1044B 2ND 3 4 3 4 4 3 3.333333 3.666667

(28)

5 1044B 2ND 4 4 3 4 4 3 3.333333 3.666667 6 1044B 2ND 3 3 2 2 1 3 3.333333 3.666667 7 1044C 1ST 3 3 3 3 3 2 3.000000 2.000000 8 1044C 1ST 3 1 4 3 4 2 3.000000 2.000000 9 1044C 2ND 3 3 3 3 3 3 2.000000 2.000000 10 1044C 2ND 2 2 2 3 2 3 2.000000 2.000000 11 1044C 2ND 1 1 1 3 3 3 2.000000 2.000000 The aggregate and merge functions provide tools necessary to manipulate data and prepare it for subsequent multilevel analyses (excluding growth modeling considered later). Again, note that this illustration uses a relatively complex situation where there are two levels of nesting (Platoon within Unit). In cases where there is only one grouping variable (for example, UNIT) the commands for aggregate and merge contain the name of a single grouping variable. For instance,

>TEMP<-aggregate(cohesion[,3:4],list(cohesion$UNIT),mean,na.rm=T)

3.3 Within-Group Agreement and Reliability

The data used in this section are taken from Bliese, Halverson & Rothberg (2000). The examples are based upon the bhr2000 data set from the multilevel package. Thus, the first step is to make the bhr2000 data set available for analysis and examine the properties of the

dataframe.

> help(bhr2000)

> data(bhr2000)#imports the data into the working environment > names(bhr2000)

[1] "GRP" "AF06" "AF07" "AP12" "AP17" "AP33" "AP34" "AS14" "AS15" "AS16" "AS17" "AS28" "HRS" "RELIG" > nrow(bhr2000)

[1] 5400

The names function identifies 14 variables. The first one, GRP, is the group identifier. The variables in columns 2 through 12 are individual responses on 11 items that make up a leadership scale. HRS represents individuals’ reports of work hours, and RELIG represents individuals’ reports of the degree to which religion is a useful coping mechanism. The nrow command indicates that there are 5400 observations. To find out how many groups there are we can use the length command in conjunction with the unique command

> length(unique(bhr2000$GRP)) [1] 99

There are several functions in the multilevel library that are useful for calculating and interpreting agreement indices. These functions are rwg, rwg.j, rwg.sim, rwg.j.sim, rwg.j.lindell, awg, ad.m, ad.m.sim and rgr.agree. The rwg function calculates the James, Demaree & Wolf (1984) rwg for single item measures; the rwg.j function calculates

the James et al. (1984) rwg(j) for multi-item scales. The rwg.j.lindell function calculates

r*wg(j) (Lindell, & Brandt, 1997; 1999). The awg function calculates the awg agreement index

proposed by Brown and Hauenstein (2005). The ad.m function calculates average deviation (AD) values for the mean or median (Burke, Finkelstein & Dusig, 1999). A series of functions with “sim” in the name (rwg.sim, rwg.j.sim and ad.m.sim) allow one to simulate

(29)

agreement values from a random uniform distribution to test for statistical significance agreement. The simulation functions are based on work by Dunlap, Burke and Smith-Crowe (2003); Cohen, Doveh and Eich (2001) and Cohen, Doveh and Nuham-Shani (2009). Finally, the rgr.agree function performs a Random Group Resampling (RGR) agreement test (see Bliese, et al., 2000).

In addition to the agreement measures, there are two multilevel reliability measures, ICC1 and ICC2 than can be used on ANOVA models. As Bliese (2000) and others (e.g., Kozlowski & Hattrup, 1992; Tinsley & Weiss, 1975) have noted, reliability measures such as the ICC(1) and ICC(2) are fundamentally different from agreement measures; nonetheless, they often provide complementary information to agreement measures, so this section illustrates the use of each of these functions using the dataframe bhr2000.

3.3.1 Agreement: rwg, rwg(j), and r*wg(j)

Both the rwg and rwg.j functions are based upon the formulations described in James et al. (1984). Both functions require the user to specify three pieces of information. The first piece of information is the variable of interest (x), the second is the grouping variable (grpid), and third is the estimate of the expected random variance (ranvar). The default estimate of ranvar is 2, which is the expected random variance based upon the rectangular distribution for a 5-point item (i.e., EU2) calculated using the formula ranvar=(A^2-1)/12 where A represents the number

of response options associated with the scale anchors. See help(rwg), James et al., (1984), or Bliese et al., (2000) for details on selecting appropriate ranvar values.

To use the rwg function to calculate agreement for the coping using religion item (RELIG in the bhr2000 dataframe) one would issue the following commands.

> RWG.RELIG<-rwg(bhr2000$RELIG,bhr2000$GRP,ranvar=2) > RWG.RELIG[1:10,] #examine first 10 rows of data grpid rwg gsize 1 1 0.11046172 59 2 2 0.26363636 45 3 3 0.21818983 83 4 4 0.31923077 26 5 5 0.22064137 82 6 6 0.41875000 16 7 7 0.05882353 18 8 8 0.38333333 21 9 9 0.14838710 31 10 10 0.13865546 35

This returns a dataframe with three columns. The first column contains the group names (grpid), the second column contains the 99 rwg values – one for each group. The third column

contains the group size. To calculate the mean rwg value use the summary command:

> summary(RWG.RELIG) grpid rwg gsize 1 : 1 Min. :0.0000 Min. : 8.00 10 : 1 1st Qu.:0.1046 1st Qu.: 29.50 11 : 1 Median :0.1899 Median : 45.00 12 : 1 Mean :0.1864 Mean : 54.55 13 : 1 3rd Qu.:0.2630 3rd Qu.: 72.50

(30)

14 : 1 Max. :0.4328 Max. :188.00 (Other):93

The summary command informs us that the average rwg value is .186 and the range is from 0

to 0.433. By convention, values at or above 0.70 are considered good agreement, so there appears to be low agreement among individuals with regard to coping using religion. The summary command also provides information about the group sizes.

Other useful options might include sorting the values or examining the values in a histogram. Recall that the notation [,2] selects all rows and the second column of the RWG.RELIG object – the column with the rwg results.

> sort(RWG.RELIG[,2]) > hist(RWG.RELIG[,2])

To calculate rwg for work hours, the expected random variance (EV) needs to be changed from

its default value of 2. Work hours was asked using an 11-point item, so EV based on the rectangular distribution (EU2) is 10.00 (EU2=(112-1)/12) – see the rwg help file for details).

> RWG.HRS<-rwg(bhr2000$HRS,bhr2000$GRP,ranvar=10.00) > mean(RWG.HRS[,2])

[1] 0.7353417

There is apparently much higher agreement about work hours than there was about whether group members used religion as a coping mechanism in this sample. By convention, this mean value would indicate agreement because rwg (and rwg(j)) values above .70 are considered to

provide evidence of agreement.

The use of the rwg.j function is nearly identical to the use of the rwg function except that the first argument to rwg.j is a matrix instead of a vector. In the matrix, each column

represents one item in the multi-item scale, and each row represents an individual response. For instance, columns 2-12 in bhr2000 represent 11 items comprising a leadership scale. The items were assessed using 5-point response options (Strongly Disagree to Strongly Agree), so the expected random variance is 2.

> RWGJ.LEAD<-rwg.j(bhr2000[,2:12],bhr2000$GRP,ranvar=2) > summary(RWGJ.LEAD) grpid rwg.j gsize 1 : 1 Min. :0.7859 Min. : 8.00 10 : 1 1st Qu.:0.8708 1st Qu.: 29.50 11 : 1 Median :0.8925 Median : 45.00 12 : 1 Mean :0.8876 Mean : 54.55 13 : 1 3rd Qu.:0.9088 3rd Qu.: 72.50 14 : 1 Max. :0.9440 Max. :188.00 (Other):93 Note that Lindell and colleagues (Lindell & Brandt, 1997, 1999; 2000; Lindell, Brandt &

Whitney, 1999) have raised concerns about the mathematical underpinnings of the rwg(j) formula.

Specifically, they note that this formula is based upon the Spearman-Brown reliability estimator. Generalizability theory provides a basis to believe that reliability should increase as the number of measurements increase, so the Spearman-Brown formula is defensible for measures of

reliability. There may be no theoretical grounds, however, to believe that generalizability theory applies to measures of agreement. That is, there may be no reason to believe that agreement