Quantitative, Positivist Research Methods in Information Systems



Section 6. Issues of Measurement

6.1 Priority of Measurement
Measurement is, arguably, the most important thing that a quantitative, positivist researcher can do to ensure that the results of the study can be trusted. Figure 5 shows how to prioritize the assessment of measurement as opposed to other validities such as internal validity and statistical conclusion validity.


Figure 5. Validation Decision Tree (Based on Straub, Boudreau, and Gefen, 2004 and Straub, 1989). [Legend: Green is preferred path; yellow is cautionary; red is least desirable]

    Row 1: Ooooppppsss. Good statistical conclusion validity, poor internal validity, no measurement validity
    Imagine a situation where you carry out a series of statistical tests and find terrific significance. You are hopeful that your model is accurate and that the statistical conclusions will show that the relationships you posit are true and important. Unfortunately, unbeknownst to you, the model you specify is wrong. There are variables you have not included that explain even more variance than your model does!! So you have poor internal validity, but good statistical conclusion validity. Even worse, you did not validiate your constructs so you do not know whether the statistical results are telling you something about the true relationship or about a fallacious relationship deriving from poor measures.

    Row 2: Ooooppppsss. Good internal validity, good statistical conclusion validity, poor instrumentation validity
    Why would this be so? Internal validity is a matter of causality. Well, can we rule out other reasons for why the IVs and DVs are or are not related (Cook and Campbell, 1979, and many other researchers such as Bagozzi, 1980)? Consider the following. You are testing constructs to see which variable would or could "confound" your contention that a certain variable or research construct is as good an explanation for a set of effects. But if either the posited IV or the confound (rival construct) is poorly measured, then you cannot know with any certainty whether one or the other variable is the true cause. So you surely have an internal validity problem that is really not simply a matter of testing the strength of either the confound or the theoretical IV on the outcome, but it is a matter of whether you can trust the measurement of either variable. Without instrumentation validity, it is really not possible to assess internal validity. Nevertheless, we are still in better shape than in row 1 since, if the measures are ok, then we have established internal validity and the statistics are now more meaningful. But we have to assume that the measures are acceptable, which may or may not be so.

    Row 3: The Big Picture (All forms of validity)
    You have validated your measures, you have tested for rival explanations, and you have used the correct stats. This is ideal!

    What is important to remember is that instrumentation or measurement validity is the critical first step in quantitative, positivist research. If your instrumentation is not acceptable at a minimal level, then the findings from the study are perfectly meaningless. You cannot trust or contend that you have internal validity or statistical conclusion validity. Reviewers should be especially honed in to measurement problems for this reason. If the measures are not reasonable, then forget it! The paper should be rejected. There is no scientific value to the work. In this web site, we have changed the terminology of this form of validity from "instrument validity" in Straub (1989) to instrumentation validity. We will discuss later why other matters are important besides the instrument itself. In an experiment, for example, it is critical that a researcher check not only the experimental instrument, but also whether the experimental task is properly phrased. This task is part of the instrumentation, but not part of the instrument.
6.2 Details of Instrumentation Validity
Straub, Boudreau, and Gefen (2004) describe the "ins" and "outs" of instrumentation validity. Their paper presents the arguments for why various forms of instrumentation validity should be mandatory and why others are optional. Basically, there are four types of scientific validity with respect to instrumentation. They are: (1) content validity, (2) construct validity, (3) reliability, and (4) manipulation validity.
    6.2.1. Content Validity
    Content validity includes all the ways that a researcher can employ to prove that his/her instrumentation is a not unreasonable drawing from all of the forms that the researcher could have chosen to capture a construct (Cronbach, 1971). Suppose you included satisfaction with the IS staff in your measurement of a construct called User Information Satisfaction (UIS) but you forgot to include satisfaction with the system itself? Another researcher might feel that you have missed the boat and that you did not draw well from all of the possible measures of this construct. This person could legitimately argue that your content validity was not the best.

    6.2.2. Construct Validity
    With construct validity, we are interested in whether the instrumentation has truly captured operations that will result in constructs that are not subject to common methods bias and other forms of bias. Maybe some of the questionnaire items, the verbiage in the interview script, or the task descriptions in an experiment are ambiguous and are giving the participants the impression that they mean something different from what was intended. This is a construct validity issue. The problems occur in two major ways. Items or phrases in the instrumentation are not related in the way they should be or they are not related in the ways they should not be. If the items do not converge or run together as they should, it is called a convergent validity problem. If they do not segregate or differ from each other as they should, then it is called a discriminant validity problem.

    As Straub, Boudreau, and Gefen (2004), Gefen, Straub, and Boudreau (2000) and Straub (1989) point out, there are numerous ways to test for construct validity. A Principal Components Analysis (PCA) of a dataset will show how many factors the statistical algorithm in a package like SPSS or SAS has determined is the best postulated model fit to the data. Suppose you posit four constructs or factors? How do you test for convergent validity? You know from the fact that you invented the items/phrasing/experimental task descriptions that you think that items 1, 5, and 23 are supposed to capture a latent or unobservable construct that you are thinking could be called "Perceived Usefulness" (PU) of an information system. Moreover, you are of the opinion that items 8, 11, and 25 are items representing a latent construct that you have conceptualized as "Perceived Ease-of-Use" (PEOU). If the PCA includes all 6 of these items and yields a result that the expected PEOU item loads more highly (and at a level equal to or greater than .5) onto the presumed PU factor than the PEOU factor, then this "cross-loading" would seem to indicate that the construct validity, viz. the convergent validity, was not acceptable. If, on the other hand, items 1, 5, and 23 load most highly on PU and items 8, 11, and 25 load on PEOU, then we can say that they loaded "cleanly" and we have established convergent validity.

    In covariance-based SEM such as LISREL, convergent validity is examined by looking at the significance of the loadings, first, and the modification index. In LISREL, modification indices find where the fit in the model could be improved by sacrificing a degree of freedom by adding a path in the model. If the path that is added in an improved model alters some of the item-to-construct a priori theoretical specification, then this is equivalent to a cross-loading in PCA. The same PCA can also be used to check measurement properties that show that the constructs are distinct from each other. If the items load "cleanly," as expressed above, then the factor analysis also differentiates between the constructs and establishes discriminant validity. The most robust test of both convergent and discriminant validity is to include other items that are not part of the theoretical model. These items should not include variables that are downstream or upstream in the theoretical model from the variables being tested, as argued by Gefen, Straub, and Boudreau (2000) . In this article, they point out that variables in another "stage" of the theoretical model are, in fact, expected to covary with their effects, for example, and so a researcher is submitting the data to an unfair test. In the typical TAM test, for instance, a downstream effect like attitude toward use, intention to use, or use may actually cross-load on PU or PEOU since these are casually related. A cross-loaded result like this complicates the life of a researcher since it should be reported and the reviewers may be confused by the cross-loading and reject the paper; the explanation that these variables were casually related may sound lame and the researcher is really stuck then trying to offer explanations.

    Other ways to test construct validity are explicated in Gefen, Straub, and Boudreau (2000) . In this tutorial, structural equation modeling (SEM) is set forth as a means to examine construct validity. With least squares SEM such as partial least squares (PLS), the researcher specifies the indicators or items that are expected to load onto a set of latent constructs. To prove convergent validity, the test is twofold. First, are the loadings significant? PLS generates T-statistics for loadings by running a bootstrap (preferred) or jackknife on the raw data. Second, are the loadings greater than .5 (Fornell and Larckner, 1981)? To prove discriminant validity, a statistic known as average variance explained (AVE) is calculated as the square root of the average communality. This stat applies to each latent construct. The test for discriminant validity, according to Fornell and Larckner (1981), is to compare these stats with the correlations between the latent constructs. The latter are usually generated by the PLS stat package, but they can also be created by the researcher using the formulations in Diamantopoulos and Winklhofer (2001). Placing these numbers into a diagonal matrix allows one to compare the AVEs to the other numbers in the same row and column. A spreadsheet that assists researchers in calculating AVEs as well as PLS composite reliability can be downloaded by clicking here. In one sense, a researcher need not test construct validity when using covariance-based SEM like LISREL. Latent variables are posited to have both common and specific variance, and the maximum likelihood estimators associated with each item account for measurement error in the model. Coefficients for the loadings and the paths also estimate the shared variance between latent constructs, so most kinds of error are accounted for in determining the loadings and paths. In short, SEM, including both least-square and covariance-based techniques, accounts for error, including measurement error, in a holistic way. The argument is made by some that the overall statistic in LISREL known as root mean square residual (RMSR) indicates how much variance is not explained by the data. According to Gefen, Straub, and Boudreau (2000) , if RMSR is greater than .05, then there is too much error, including measurement error.

    6.2.3. Reliability
    Reliability is the assurance that the items posited to measure a construct are sufficiently related to be reliable (i.e., low on measurement error) considered as a set of items (Cronbach, 1951). The difference between reliability and convergent validity is a matter of how ones proves that a set of items "run" together. In convergent validity, this test involves a comparison to other variables. Reliability tests look only at the items in the scale and do not compare across constructs. There are many ways in which reliability can be tested. As Straub, Boudreau, and Gefen (2004) show, these include: internal consistency (also known as Cronbach's alpha), composite reliability (PLS and LISREL stats), split-half reliability, test-retest reliability, alternate forms of reliability, inter-rater reliability, and unidimensional reliability. This article discusses each of these in detail, and so these topics do not bear repeating here.

    6.2.4. Manipulation Validity
    This form of validity applies only to experiments and is discussed in greater detail, including stats for assessing it, in Straub, Boudreau, and Gefen (2004). Suffice it to say at this point that in experiments, it is critical that the subjects are manipulated by the treatments and, conversely, that the control group is not manipulated. One way to do this is to ask subjects. Those who were aware that they were manipulated are good subjects. Those who were not aware, depending on the nature of the treatments, should have been assigned to the control group.

    Those who were not manipulated add unwanted variance to your sample since you are measuring the effects of a treatment that did not take place. So it is quite legitimate to remove them from the tested samples. On the other hand, if you still get the effects, in between-group studies especially, then you have a very robust effect. You could quite rightly argue that the effect is larger than it would have been had you restricted the samples to those who replied in a manner that indicated that they were, indeed, manipulated or control subjects.

    6.2.5. Other Types of Instrumentation Validity
    There are other types of validation that apply to social science research that do not fall neatly into the categories covered in Gefen, Straub, and Boudreau (2000) . The procedures that accompany most research should be subject to scrutiny even though there may be no straight-forward statistical test that verifies that the procedure works. Interviews typically follow a script or sometimes called an "interview schedule" (this sounds like a misnomer, but this is a common way of talking about the interview script). Does the script create the wrong atmosphere for an interview where the interviewer is attempting to gather sensitive information, for instance, from the informant? In an experiment, are the instructions describing the experimental task clear? Or are the subjects making mistakes in what they are doing and these mistakes are thought to be ordinary variance associated with differential performance? The way resarchers typically ensure that these forms of instrumentation are adequate from a measurement standpoint is to run pretests and pilot tests. Debriefing the respondents and subjects can help to identify areas where the instrumentation is weak or leading the respondents astray.
6.3 Fixing Problems with Construct Validity
Suppose you test for construct validity and find out that some of your constructs are problematic. What can and should you do? The approach described in Churchill (1979) is to purify the measures. If one follows this approach including pretesting and pilot testing the instrumentation, the likelihood that there will be no problems to begin with is much better. Churchill (1979) also justifies a view that holds that measures that do not contribute to the constructs in a factor analysis can be dropped. This would include PCA, PLS, and covariance-based SEM.