Contents

Brief introduction to Statistic

Daxue Consulting

1 Methods for Describing a Set of Data 2

1.1 NumericalmeasureofCentralTendency .............................. 2 1.2 NumericalMeasuresofRelativeStanding ............................. 5

2 Random variables and probability distribution 6

2.1 Twotypesofrandomvariables ................................... 6

3 Discrete Random Variables 6

3.1 ProbabilityDistributionforDiscreteRandomVariables . . . . . . . . . . . . . . . . . . . . . 6 3.2 TheBinomialDistribution...................................... 8 3.3 OtherdiscreteDistribution ..................................... 10

4 Continuous Random Variables 11

4.1 ProbabilityDistributionsforContinuousRandomVariables . . . . . . . . . . . . . . . . . . . 11 4.2 TheNormalDistribution....................................... 11

5 Estimation with confidence intervals 12

5.1 Confidence Interval for a Population Mean: Normal (z) Statistic (known Variance) . . . . . . 13

5.2 Confidence Interval for a Population Mean: Student’s t-Statistic (Unknown Variance) . . . . 15

5.3 SamplingsizeofPopulationMean ................................. 16

5.4 Large-SampleConfidenceIntervalforaPopulationProportion . . . . . . . . . . . . . . . . . 17

5.5 SamplingSizeofPopulationProportion .............................. 18

6 Hypothesis testing 19

6.1 Theelementsofatestofhypothesis ................................ 19

7 Inference about two populations 20

7.1 PopulationMeanBetweenTwoMatchedSamples......................... 21 7.2 ComparingTwoPopulationMeans: IndependentSampling . . . . . . . . . . . . . . . . . . . 22 7.3 ComparisonofTwoPopulationProportions ............................ 23

8 Chi-squared Test of Independence 24

8.1 TestingCategoryProbabilities:One-WayTable.......................... 24 8.2 Testing Category Probabilities: Two-Way (Contingency) Table . . . . . . . . . . . . . . . . . 26 8.3 Chi-squaredTestofIndependence ................................. 28

9 Non Parametric tests 30

9.1 SignTest ............................................... 31 9.2 Wilcoxon-Whitney-WilcoxonTest.................................. 31 9.3 Mann-Whitney-WilcoxonTest.................................... 31 9.4 Kruskal-WallisTest.......................................... 32

1

1 Methods for Describing a Set of Data 1.1 Numerical measure of Central Tendency

When we speak of a data set, we refer to either a sample or a population. If statistical inference is our goal, we’ll wish ultimately to use sample numerical descriptive measures to make inferences about the corresponding measures for the population.

1.1.1

1. 2.

The central tendency of the set of measurements: the tendency of the data to cluster, or center, about certain numerical values

The variability of the set of measurements: the spread of the data

Measure of central tendency

1. Mean

The **mean** of a set of quantitative data is the sum of the measurements divided by the number of measu

The sample mean, x ̄, will play an important role in accomplishing our objective of making inferences about populations based on sample information.

For this reason, we need to use a different symbol for the mean of a population. • x ̄: sample mean

• μ: Population mean

We’ll often use the sample mean x ̄ to estimate (make an inference about) the population mean, μ.

Example

For example, the percentages of revenues spent on R&D by the population consisting of all U.S. companies has a mean equal to some value, μ.

Our sample of 50 companies yielded percentages with a mean of x ̄ = 8.492. If, as is usually the case, we don’t have access to the measurements for the entire population, we could use x ̄ as an estimator or approximator for μ.

Then we’d need to know something about the reliability of our inference-that is, we’d need to know how accurately we might expect x ̄ to estimate μ

In next tutorial, we’ll find that this accuracy depends on two factors:

1. The size of the sample. The larger the sample, the more accurate the estimate will tend to be.

2

2. The variability, or spread, of the data. All other factors remaining constant, the more variable the data, the less accurate the estimate.

Median

The median of a quantitative data set is the middle number when the measurements are arranged in ascendi

In certain situations, the median may be a better measure of central tendency than the mean. In particular, the median is less sensitive than the mean to extremely large or small measurements.

A data set is said to be skewed if one tail of the distribution has more extreme observations than the o

Mode

The mode is the measurement that occurs most frequently in the data set.

1.1.2 Numerical Measures of Variability

Measures of central tendency provide only a partial description of a quantitative data set. The description is incomplete without a measure of the variability, or spread, of the data set.

Knowledge of the data’s variability along with its center can help us visualize the shape of a data set as well as its extreme values.

The sample variance for a sample of n measurements is equal to the sum of the squared deviations from th

Formula

Note: that the population vairance is

2 ( x i − x ̄ ) 2 σ=n−1

2 (xi − μ)2 σ=N

The second step in finding a meaningful measure of data variability is to calculate the standard deviation of the data set.

3

The sample standard deviation is defined as the positive square root of the sample variance

Notice that, unlike the variance, the standard deviation is expressed in the original units of measurement. For example, if the original measurements are in dollars, the variance is expressed in the peculiar units “dollars squared”“, but the standard deviation is expressed in dollars.

You may wonder why we use the divisor (n − 1) instead of n when calculating the sample variance. Wouldn’t using n be more logical so that the sample variance would be the average squared deviation from the mean?

The trouble is that using n tends to produce an underestimate of the population variance so we use (n − 1) in the denominator to provide the appropriate correction for this tendency.

You now know that the standard deviation measures the variability of a set of data.

• The larger the standard deviation, the more variable the data. • The smaller the standard deviation the less variable the data

1.1.3 Using the Mean and Standard Deviation to Describe Data

We’ve seen that if we are comparing the variability of two samples selected from a population, the sample with the larger standard deviation is the more variable of the two. Thus, we know how to interpret the standard deviation on a relative or comparative basis, but we haven’t explained how it provides a measure of variability for a single sample.

To understand how the standard deviation provides a measure of variability of a data set, consider a specific data set and answer the following questions:

• How many measurements are within 1 standard deviation of the mean? • How many measurements are within 2 standard deviations?

The Empirical Rule is a rule of thumb that applies to data sets with frequency distributions that are mound-shaped and symmetric, as shown below.

• Approximately 68% of the measurements willfall within 1 standard deviation of the mean

• Approximately 95% of the measurements will fall within 2 standard deviations of the mean

• Approximately 99.7% (essentially all) of the measurements will fall within 3 standard deviations of the

mean

Example

A manufacturer of automobile batteries claims that the average length of life for its grade A battery is 60 months. However, the guarantee on this brand is for just 36 months. Suppose the standard deviation of the life length is known to be 10 months, and the frequency distribution of the life-length data is known to be mound-shaped.

1. Approximately what percentage of the manufacturer’s grade A batteries will last more than 50 months, assuming the manufacturer’s claim is true?

4

2. Approximately what percentage of the manufacturer’s batteries will last less than 40 months, assuming the manufacturer’s claim is true?

3. Suppose your battery lasts 37 months. What could you infer about the manufacturer’s claim?

Answer

1. It is easy to see that the percentage of batteries lasting more than 50 months is approximately 34% (between 50 and 60 months) plus 50% (greater than 60 months). Thus, approximately 84% of the

batteries should have life length exceeding 50 months.

2. The percentage of batteries that last less than 40 months can also be easily determined. Approximately

2.5% of the batteries should fail prior to 40 months, assuming the manufacturer’s claim is true.

3. If you are so unfortunate that your grade A battery fails at 37 months, you can make one of two inferences: either your battery was one of the approximately 2.5% that fail prior to 40 months, or something about the manufacturer’s claim is not true. Because the chances are so small that a battery fails before 40 months, you would have good reason to have serious doubts about the manufacturer’s claim. A mean smaller than 60 months and/or a standard deviation longer than 10 months would both

increase the likelihood of failure prior to 40 months.

1.2 Numerical Measures of Relative Standing

Another measure of relative standing in popular use is the z-score. As you can see in the definition of z-score below, the z-score makes use of the mean and standard deviation of the data set in order to specify the relative location of a measurement. Note that the z-score is calculated by subtracting x ̄ (or μ) from the measurement x and then dividing the result by s (or σ). The final result, the z-score, represents the distance between a given measurement x and the mean, expressed in standard deviations.

with s is the sample sd

Example

z = x − x ̄ s

z=x−μ σ

A random sample of 2,000 students who sat for the Graduate Management Admission Test (GMAT) is selected.

For this sample, the mean GMAT score is x = 540 points and the standard deviation is s = 100 points.

One student from the sample, Kara Smith, had a GMAT score of x = 440 points. What is Kara’s sample z-score?

5

(440-540)/100

## [1] -1

This z-score implies that Kara Smith’s GMAT score is 1.0 standard deviations below the sample mean GMAT score, or, in short, her sample z-score is - 1.0.

Interpretation of z-Scores for Mound-Shaped Distributions of Data

1. Approximately 68% of the measurements will have a z-score between -1 and 1.

2. Approximately 95% of the measurements will have a z-score between - 2 and 2.

3. Approximately 99.7% (almost all) of the measurements will have a z-score between -3 and 3.

2 Random variables and probability distribution

A **random variable** is a variable that assumes numerical values associated with the random outcomes of

2.1 Two types of random variables

Random variables that can assume a countable number (finite or infinite) of values are called discrete.

Example of disrete random variable

Random variables that can assume values corresponding to any of the points con- tained in one or more in

## # A tibble: 10 x 1

## c1

## <dbl>

## 1 7.02

## 2 7.85

## 3 6.76

## 4 4.82

## 5 4.56

## 6 5.03

## 7 3.09

## 8 4.97

## 9 5.17

## 10 5.63

3 Discrete Random Variables

3.1 Probability Distribution for Discrete Random Variables

A complete description of a discrete random variable requires that we specify the possible values the random variable can assume and the probability associated with each value.

The probability distribution of a discrete random variable is a graph, table, or formula that specifies

Imagine we toss two coins. The random variable x can assume values 0, 1, and 2. What is the probability of: P (x = 0) = P (T T )

P(x = 1) = P(TH) + P(HT)

6

P(x = 2) = P(HH) T = tail

H = Head

## [1] "TT= 0.25 , TH= 0.5 , HH= 0.25"

TT <- 1/4

TH <- 1/4 + 1/4

HH <- 1/4

print(paste("TT=", TT, ", TH=", TH, ", HH=", HH))

0.5

0.4

0.3

0.2

0.1

0.0

HH TH TT

Toss

Toss HH

TH TT

The probabilities as the heights of vertical lines over the corresponding values of x.

Mean of a discrete random variable

To get the population mean of the random variable x, we multiply each possible value of x by its probability p(x) and then sum this product over all possible values of x. The mean of x is also referred to as the expected value of x, denoted E(x).

μ = E(x) = xp(x)

Example

Suppose you work for an insurance company, and you sell a 10,000 one- year term insurance policy at an annual premium of 290. Actuarial tables show that the probability of death during the next year for a person of your customer’s age, sex, health, etc., is .001.

What is the expected gain (amount of money made by the company) for a policy of this type? 7

Prob

290*.999 + (9710)*.001

## [1] 299.42

Variance of a discrete random variable

The population variance σ2 is defined as the average of the squared distance of x from the population mean μ. Because x is a random variable, the squared distance, (x − μ)2, is also a random variable.

σ2 =E[(x−σ)2]= (x−μ)2p(x)

The standard deviation of x is defined as the square root of the variance σ2.

Example

Suppose you invest a fixed sum of money in each of five Internet business ventures.

Assume you know that 70% of such ventures are successful, the outcomes of the ventures are independent of one another, and the probability distribution for the number, x, of successful ventures out of five is:

x012345 p(x) 0,002 0,029 0,132 0,309 0,36 0,168

1. Find μ 2. Find σ

Answers

0*.002+1*.029+2*.132+3*.309+4*.360+5*.168 ## [1] 3.5

On average, the number of successful ventures out of five will equal 3.5

(0-3.5)^2*.002 + (1-3.5)^2*.029+ (2-3.5)^2*.132+(3-3.5)^2*.309+(4-3.5)^2*.360+(5-3.5)^2*.168 ## [1] 1.048

3.2 The Binomial Distribution

Many experiments result in dichotomous responses. It means responses for which there exist two possible alternatives, such as Yes-No, Pass-Fail, Defective-Nondefective, or Male-Female.

Formula

with:

n x n−x p(x) = x p q

p = Probability of success on a single trial q = 1-p n = Number of trials x = Number of successess in n trials

n − x= Number of failures in n trials

n n!

x =x!(n−x)!

par(mfrow=c(2, 5))

for(p in seq(0.1, 1, len=10)) {

x <- dbinom(0:20, size=20, p=p)

8

barplot(x, names.arg=0:20, space=0) }

0 8 17 0 8 17 0 8 17 0 8 17 0 8 17

0 8 17 0 8 17 0 8 17 0 8 17 0 8 17

A simple example of such an experiment is the coin-toss experiment. A coin is tossed a number of times, say 10. Each toss results in one of two outcomes, Head or Tail. Ultimately, we are interested in the probability distribution of x, the number of heads observed. Many other experiments are equivalent to tossing a coin (either balanced or unbalanced) a fixed number n of times and observing the number x of times that one of the two possible outcomes occurs.

Random variables that possess these characteristics are called binomial random variables.

Survey frequently yields observations on binomial random variables. For example, suppose a sample of 100 current customers is selected from a firm’s database and each person is asked whether he or she prefers the firm’s product (a Head) or prefers a competitor’s product (a Tail). Suppose we are interested in x, the number of customers in the sample who prefer the firm’s product. Sampling 100 customers is analogous to tossing the coin 100 times.

Example

Suppose there are twelve multiple choice questions in an English class quiz. Each question has five possible answers, and only one of them is correct. Find the probability of having four or less correct answers if a student attempts to answer every question at random.

Since only one out of five possible answers is correct, the probability of answering a question correctly by random is 1/5=0.2. We can find the probability of having exactly 4 correct answers by random attempts as follows.

dbinom(4, size=12, prob=0.2) ## [1] 0.1328756

To find the probability of having four or less correct answers by random attempts, we apply the function dbinom with x = 0,. . . ,4.

dbinom(0, size=12, prob=0.2) + dbinom(1, size=12, prob=0.2) +

9

0.00 0.10

0.00 0.10 0.20

0.00 0.10

0.00 0.10 0.20

0.00 0.10 0.20

0.00 0.10

0.00 0.10 0.20

0.00 0.10

0.0 0.4 0.8

0.00 0.10

dbinom(2, size=12, prob=0.2) + dbinom(3, size=12, prob=0.2) + dbinom(4, size=12, prob=0.2)

## [1] 0.9274445

Alternatively, we can use the cumulative probability function for binomial distribution pbinom.

pbinom(4, size=12, prob=0.2) ## [1] 0.9274445

The probability of four or less questions answered correctly by random in a twelve question multiple choice quiz is 92.7%.

3.3 Other discrete Distribution

3.3.1 Poisson

A type of discrete probability distribution that is often useful in describing the number of rare events that

will occur in a specific period of time or in a specific area or volume is the Poisson distribution Formula

p(x) = λxe−λ x!

with μ = λ = mean number of events during given unit of time, area Example

1. The number of industrial accidents per month at a manufacturing plant

2. The number of noticeable surface defects (scratches, dents, etc.) found by quality inspectors on a new

automobile

3. The parts per million of some toxin found in the water or air emission from a manufacturing plant 4. The number of customer arrivals per unit of time at a supermarket checkout counter

If there are twelve cars crossing a bridge per minute on average, find the probability of having seventeen or more cars crossing the bridge in a particular minute.

The probability of having sixteen or less cars crossing the bridge in a particular minute is given by the function ppois.

ppois(16, lambda=12) # lower tail ## [1] 0.898709

Hence the probability of having seventeen or more cars crossing the bridge in a minute is in the upper tail of the probability density function.

ppois(16, lambda=12, lower=FALSE) # upper tail ## [1] 0.101291

If there are twelve cars crossing a bridge per minute on average, the probability of having seventeen or more cars crossing the bridge in a particular minute is 10.1%.

10

4 Continuous Random Variables

4.1 Probability Distributions for Continuous Random Variables

The graphical form of the probability distribution for a continuous random variable x is a smooth curve. This curve, a function of x, is denoted by the symbol f(x) and is variously called a probability density function

(pdf), a frequency function, or a probability distribution.

The probability distribution for a continuous random variable, x, can be represented by a smooth curve-a

4.2 The Normal Distribution

One of the most commonly observed continuous random variables has a bell-shaped probability distribution (or bell curve).

Formula

1 −1/2 2 f(x)= σ√2πe [(x−μ)/σ]

We can transform the normal distribution in standard normal distribution with mean μ = 0 and sd σ = 1 1 −1/22

f(x) = σ√2πe z

The normal distribution plays a very important role in the science of statistical inference. Moreover, many business phenomena generate random variables with probability distributions that are very well approximated by a normal distribution. For example, the monthly rate of return for a particular stock is approximately a normal random variable, and the probability distribution for the weekly sales of a corporation might be approximated by a normal probability distribution. The normal distribution might also provide an accurate model for the distribution of scores on an employment aptitude test.

Example

Assume that the test scores of a college entrance exam fits a normal distribution. Furthermore, the mean test score is 72, and the standard deviation is 15.2.

• What is the percentage of students scoring 84 or more in the exam?

We apply the function ‘pnorm of the normal distribution with mean 72 and standard deviation 15.2. Since we are looking for the percentage of students scoring higher than 84, we are interested in the upper tail of the normal distribution.

## [1] 0.2149176

The percentage of students scoring 84 or more in the college entrance exam is 21.5%. Converting a Normal Distribution to a Standard Normal Distribution

If x is a normal random variable with mean m and standard deviation σ, then the random variable z, defined by the formula

pnorm(84, mean=72, sd=15.2, lower.tail=FALSE)

Example

z=x−μ σ

11

Suppose an automobile manufacturer introduces a new model that has an advertised mean in-city mileage of 27 miles per gallon. Although such advertisements seldom report any measure of variability, suppose you write the manufacturer for the details of the tests, and you find that the standard deviation is 3 miles per gallon.

This information leads you to formulate a probability model for the random variable x, the in-city mileage for this car model. You believe that the probability distribution of x can be approximated by a normal distribution with a mean of 27 and a standard deviation of 3.

## [1] -2.333333

## [1] 0.009824073

According to this probability model, you should have only about a 1% chance of purchasing a car of this make with an in-city mileage under 20 miles per gallon.

5 Estimation with confidence intervals

In this tutorial, our goal is to estimate the value of an unknown population parameter, such as a population mean or a proportion from a binomial population. For example, we might want to know the mean gas mileage for a new car model, the average expected life of a flat-screen computer monitor, or the proportion of dot-com companies that fail within a year of start-up.

We want to use the sample information to estimate the population parameter of interest (called the target parameter) and assess the reliability of the estimate.

The unknown population parameter (e.g., mean or proportion) that we are interested in estimating is call

For the examples given above, the words mean in mean gas mileage and average in average life expectancy imply that the target parameter is the population mean, μ. The word proportion in proportion of dot-com companies that fail within one year of start-up indicates that the target parameter is the binomial proportion, p.

In addition to key words and phrases, the type of data (quantitative or qualitative) collected is indicative of the target parameter.

• With quantitative data, you are likely to be estimating the mean or variance of the data. 12

(20-27)/3 pnorm(-2.333)

• With qualitative data with two outcomes (success or failure), the binomial proportion of successes is likely to be the parameter of interest.

A single number calculated from the sample that estimates a target population parameter is called a point estimator.

For example, we’ll use the sample mean, x ̄, to estimate the population mean μ. Consequently, x ̄ is a point estimator.

Similarly, we’ll learn that the sample proportion of successes, denoted pˆ, is a point estimator for the binomial proportion p and that the sample variance s2 is a point estimator for the population variance σ.

We will attach a measure of reliability to our estimate by obtaining an interval estimator-a range of numbers that contain the target parameter with a high degree of confidence.

For this reason the interval estimate is also called a confidence interval.

A point estimator of a population parameter is a rule or formula that tells us how to use the sample dat An interval estimator (or confidence interval) is a formula that tells us how to use the sample data to

5.1 Confidence Interval for a Population Mean: Normal (z) Statistic (known Variance)

According to the Central Limit Theorem, the sampling distribution of the sample mean is approximately normal for large samples. The formula is

1.96∗σ x ̄+/− √n

That is, we form an interval from 1.96 standard deviations below the sample mean to 1.96 standard deviations above the mean.

Example

Consider the large bank that wants to estimate the average amount of money owed by its delinquent debtors, m. The bank randomly samples n = 100 of its delinquent accounts and finds that the sample mean amount owed is x ̄ = 230. Also, suppose it is known that the standard deviation of the amount owed for all delinquent accounts is σ = +90.

Calculate a confidence interval for the target parameter, m.

## [1] 247.64

## [1] 212.36

The confidence coefficient is the probability that a randomly selected confidence interval encloses the population parameter’that is, the relative frequency with which similarly constructed intervals enclose the population parameter when the estimator is used repeatedly a very large number of times. The confidence level is the confidence coefficient expressed as a percentage.

Note: Empirical research suggests that a sample size n exceeding a value between 20 and 30 will usually yield a sam- pling distribution of x that is approximately normal. This result led many practitioners to adopt the rule of thumb that a sample size of n > 30 is required to use large-sample confidence interval procedures

Summary

230 + (1.96*90)/sqrt(100) 230 - (1.96*90)/sqrt(100)

13

Usual value of z

• z99= 2.576 • z95= 2.326 • z90= 1.645

Large sample Confidence Interval for M, Based on a Normal (z) Statistic When σ is known

When σ is unknown

σ x ̄ + / − z ∗ √ n

s x ̄ + / − z ∗ √ n

The sample size n is large (i.e., n > 30). Due to the Central Limit Theorem, this condition guarantees that the sampling distribution of x is approximately normal. (Also, for large n, s will be a good estimator of σ.)

Alternative

We can use the z.test from the library TeachingDemos

## [1] 0.1198559

x_bar - E

## [1] -0.05067498

##

## One Sample z-test

##

## data: x

## z = 0.79512, n = 500.000000, Std. Dev. = 0.972770, Std. Dev. of

## the sample mean = 0.043504, p-value = 0.4265

## alternative hypothesis: true mean is not equal to 0

## 95 percent confidence interval:

## -0.05067498 0.11985588

## sample estimates:

## mean of x

## 0.03459045

set.seed(123)

x <- rnorm(500)

n <- 500

x_bar <- mean(x)

sd <- sd(x)

left <- sd/sqrt(n)

E <- qnorm(.975)*left x_bar + E

library(TeachingDemos) z.test(x, sd = sd(x))

14

5.2 Confidence Interval for a Population Mean: Student’s t-Statistic (Unknown Variance)

Suppose a pharmaceutical company must estimate the average increase in blood pressure of patients who take a certain new drug. Assume that only six patients (randomly selected from the population of all patients) can be used in the initial phase of human testing. The use of a small sample in making an inference about m presents two immediate problems when we attempt to use the standard normal z as a test statistic.

2 problems

1. The shape of the sampling distribution of the sample mean x ̄ (and the z-statistic) now depends on the shape of the population that is sampled. We can no longer assume that the sampling distribution of x ̄ is approximately normal because the Central Limit Theorem ensures normality only for samples that are sufficiently large.

2. The population standard deviation σ is almost always unknown. Solution, we can use the t distribution

Formula

x ̄ − μ t = s/√n

in which the sample standard deviation, s, replaces the population standard deviation, σ.

If we are sampling from a normal distribution, the t-statistic has a sampling distribution very much like that of the z-statistic: mound-shaped, symmetric, with mean 0. The primary difference between the sampling distributions of t and z is that the t-statistic is more variable than the z, which follows intuitively when you realize that t contains two random quantities (x and σ), whereas z contains only one (x).

The actual amount of variability in the sampling distribution of t depends on the sample size n.

Example

Consider the pharmaceutical company that desires an estimate of the mean increase in blood pressure of patients who take a new drug.

The blood pressure increases for the n = 6 patients in the human testing phase. Use this information to construct a 95% confidence interval for μ, the mean increase in blood pressure associated with the new drug for all patients in the population.

The average increases is 2.283 and the sd is .949

First, note that we are dealing with a sample too small to assume that the sample mean x is approximately normally distributed by the Central Limit Theorem

Second, the variance is unknown.

To compute the CI with Student’s t-Statistic, we can cumpute

s x ̄ = + / − t σ / 2 √ n

The value of tσ/2 in the Student table is 2.571 with n − 1 degree of freedom (6-1) 2.283 + 2.571*(0.950/sqrt(6))

## [1] 3.280126

2.283 - 2.571*(0.950/sqrt(6)) ## [1] 1.285874

15

Small-Sample Confidence Interval for M, Student’s t-statistic σ is unknwon

σ is known

Alternative

s x ̄ + / − t σ / 2 √ n

σ x ̄ + / − t σ / 2 √ n

We can use the t.test in the built-in stat package

##

## One Sample t-test

##

## data: x

## t = -1.5607, df = 99, p-value = 0.1218

## alternative hypothesis: true mean is not equal to 0

## 95 percent confidence interval:

## -0.35605755 0.04253406

## sample estimates:

## mean of x

## -0.1567617

5.3 Sampling size of Population Mean

The quality of a sample survey can be improved by increasing the sample size. The formula below provide the sample size needed under the requirement of population mean interval estimate at (1 − α) confidence level, margin of error E, and population variance zα/2. Here, σ2 is the 100(1 − α/2) percentile of the standard normal distribution.

n= (z(α/2)2σ2) E2

example

Assume the population standard deviation σ of the student height in survey is 9.48. Find the sample size needed to achieve a 1.2 centimeters margin of error at 95% confidence level.

Since there are two tails of the normal distribution, the 95% confidence level would imply the 97.5th percentile of the normal distribution at the upper tail. Therefore, zα/2 is given by qnorm(.975).

## [1] 239.7454

set.seed(1234) x <- rnorm(100) t.test(x)

zstar = qnorm(.975) sigma = 9.48

E = 1.2 zstar^2*sigma^2/E^2

16

Based on the assumption of population standard deviation being 9.48, it needs a sample size of 240 to achieve a 1.2 centimeters margin of error at 95% confidence level.

5.4 Large-Sample Confidence Interval for a Population Proportion

In this part, we are interested in estimating the percentage (or proportion) of some group with a certain characteristic. In this section, we consider methods for mak- ing inferences about population proportions when the sample is large.

The formula to compute a large sample CI

pˆ+/−z n

with pˆ = x/n

Example

Many public polling agencies conduct surveys to determine the current consumer sentiment concerning the state of the economy. For example, the Bureau of Economic and Business Research (BEBR) at the University of Florida conducts quarterly surveys to gauge consumer sentiment in the Sunshine State.

Suppose that the BEBR randomly samples 484 consumers and finds that 157 are optimistic about the state of the economy. Use a 90% confidence interval to estimate the proportion of all consumers in Florida who are optimistic about the state of the economy.

We know x = 484 is the total consumers. z statistic at 90% is 1.645

## [1] 0.3593738

p_hat - 1.645*sqrt((.324*.676)/484)

## [1] 0.2893865

Alternative

Compute the margin of error and estimate interval for the female students proportion in survey dataset at 95% confidence level.

## [1] 0.4362086 0.5637914

prop.test(k, n)

##

## 1-sample proportions test without continuity correction

##

## data: k out of n, null probability 0.5

pˆ(1 − pˆ)

p_hat <- 157/484

p_hat + 1.645*sqrt((.324*.676)/484)

library(MASS)

gender.response = na.omit(survey$Sex)

n = length(gender.response) # valid responses count k = sum(gender.response == "Female")

pbar = k/n

SE = sqrt(pbar*(1-pbar)/n)

E = qnorm(.975)*SE

pbar + c(-E, E)

17

## X-squared = 0, df = 1, p-value = 1

## alternative hypothesis: true p is not equal to 0.5

## 95 percent confidence interval:

## 0.4367215 0.5632785

## sample estimates:

## p

## 0.5

At 95% confidence level, between 43.6% and 56.3% of the university students are female, and the margin of error is 6.4%.

5.5 Sampling Size of Population Proportion

The quality of a sample survey can be improved by increasing the sample size. The formula below provide the sample size needed under the requirement of population proportion interval estimate at (1 − σ) confidence level, margin of error E, and planned proportion estimate p. Here, zσ/2 is the 100(1 − α/2) percentile of the standard normal distribution.

n = (zα/2)2p(1−p) ) E2

Example

Using a 50% planned proportion estimate, find the sample size needed to achieve 5% margin of error for the female student survey at 95% confidence level.

Since there are two tails of the normal distribution, the 95% confidence level would imply the 97.5th percentile of the normal distribution at the upper tail. Therefore, z is given by qnorm(.975).

## [1] 384.1459

With a planned proportion estimate of 50% at 95% confidence level, it needs a sample size of 385 to achieve a 5% margin of error for the survey of female student proportion.

zstar = qnorm(.975)

p = 0.5

E = 0.05

zstar^2 * p * (1-p) / E^2

18

6 Hypothesis testing

Suppose you wanted to determine whether the mean waiting time in the drive-through line of a fast-food restaurant is less than 5 minutes, or whether the majority of consumers are optimistic about the economy.

In both cases you are interested in making an inference about how the value of a parameter relates to a specific numerical value.

Is it less than, equal to, or greater than the specified number? This type of inference, called a test of hypothesis.

6.1 The elements of a test of hypothesis

Suppose building specifications in a certain city require that the average breaking strength of residential sewer pipe be more than 2,400 pounds per foot of length (i.e., per linear foot). Each manufacturer who wants to sell pipe in this city must demonstrate that its product meets the specification.

Note that we are interested in making an infer- ence about the mean μ of a population. However, in this example we are less interested in estimating the value of m than we are in testing a hypothesis about its value.It means, we want to decide whether the mean breaking strength of the pipe exceeds 2,400 pounds per linear foot.

A statistical hypothesis is a statement about the numerical value of a population parameter.

We define two hypotheses:

The null hypothesis, denoted H0, represents the hypothesis that will be accepted unless the data provide The alternative (research) hypothesis, denoted Ha, represents the hypothesis that will be accepted only In our example:

• Null hypothesis (H0): μ < 2,400 (i.e., the manufacturer>s pipe does not meet specifications)

• Alternative (research) hypothesis (H1): μ > 2,400 (i.e., the manufacturer>s pipe meets specifications)

How can the city decide when enough evidence exists to conclude that the manufacturer’s pipe meets specifications?

Because the hypotheses concern the value of the population mean μ, it is reasonable to use the sample mean x ̄ to make the inference, just as we did when forming confidence intervals for μ.

The city will conclude that the pipe meets specifications only when the sample mean x ̄ convincingly indicates that the population mean exceeds 2,400 pounds per linear foot.

To decide, we compute a test statistic, i.e., a numerical value computed from the sample. Here, the test statistic is the z-value that measures the distance (in units of the standard deviation) between the value of x ̄ and the value of μ specified in the null hypothesis.

The idea is that if the hypothesis that μ equals 2,400 can be rejected in favor of μ > 2,400, then μ less than or equal to 2,400 can certainly be rejected. Thus, the test statistic is:

x ̄ − 2400 z= σ√n

Note that a value of z = 1 means that x ̄ is 1 standard deviation above μ = 2, 400; a value of z = 1.5 means that x ̄ is 1.5 standard deviations above μ = 2, 400; and so on.

The test statistic is a sample statistic, computed from information provided in the sample, that the res

19

To illustrate the use of the test, suppose we test 50 sections of sewer pipe and find the mean and standard deviation for these 50 measurements to be:

## [1] 2448.972

sd(df$x)

## [1] 218.0079

t.test(x =df$x , mu=2400, alternative = "greater")

##

## One Sample t-test

##

## data: df$x

## t = 1.5884, df = 49, p-value = 0.05931

## alternative hypothesis: true mean is greater than 2400

## 95 percent confidence interval:

## 2397.282 Inf

## sample estimates:

## mean of x

## 2448.972

We need to read the p-value. If the p-value is lower than 0.05, we can reject H0. In our case, we cannot reject H0. The true mean is not greater than 2400.

7 Inference about two populations

Many experiments involve a comparison of two populations. For instance:

• A real estate company may want to estimate the difference in mean sales price between city and suburban homes.

• A consumer group might test whether two major brands of food freezers differ in the average amount of electricity they use.

• A television market researcher wants to estimate the difference in the proportions of younger and older viewers who regularly watch a popular TV program.

The same procedures that are used to estimate and test hypotheses about a single population can be modified to make inferences about two populations.

df <- tibble(

x = rnorm(50, 2460, 200) )

mean(df$x)

Determining the Target Parameter

Parameter

μ1 − μ2 p1 − p2

σ12 /σ2

Key words

Mean difference; difference in averages

Difference between proportions, percentages, fractions, or rates; compare proportions

Ratio of variances; difference in variability or spread; compare variation

Type of Data

Quantitative Qualitative

Quantitative

20

7.1 Population Mean Between Two Matched Samples

Two data samples are matched if they come from repeated observations of the same subject. Here, we assume that the data populations follow the normal distribution. Using the paired t-test, we can obtain an interval estimate of the difference of the population means.

Paired samples: The sample selected from the first population is related to the corresponding sample from the second population.

It is important to distinguish independent samples and paired samples. Some examples are given as follows. Compare the time that males and females spend watching TV.

Example

• We randomly select 20 males and 20 females and compare the average time they spend watching TV. Is this an independent sample or paired sample?

– Independent -We randomly select 20 couples and compare the time the husbands and wives spend watching TV. Is this an independent sample or paired sample?

– Paired

Example: Drinking Water

Trace metals in drinking water affect the flavor and an unusually high concentration can pose a health hazard. Ten pairs of data were taken measuring zinc concentration in bottom water and surface water

df <- tribble( ~bottom, ~surface, .430,.415, .266,.238,

.567 ,.410,

.531,.605,

.707,.609,

.716,.632,

.651,.523,

.589,.411,

.469,.612

) head(df)

## # A tibble: 6 x 2

## bottom surface

## <dbl> <dbl>

## 1 0.430 0.415

## 2 0.266 0.238

## 3 0.567 0.410

## 4 0.531 0.605

## 5 0.707 0.609

## 6 0.716 0.632

Does the data suggest that the true average concentration in the bottom water exceeds that of surface water?

t.test(df$bottom, df$surface, paired=TRUE)

##

## Paired t-test

##

## data: df$bottom and df$surface

21

## t = 1.4667, df = 8, p-value = 0.1806

## alternative hypothesis: true difference in means is not equal to 0

## 95 percent confidence interval:

## -0.02994557 0.13461224

## sample estimates:

## mean of the differences

## 0.05233333

7.2 Comparing Two Population Means: Independent Sampling

In this section we develop both large-sample and small-sample methodologies for comparing two population means.

• In the small-sample case we use the t-statistic.

Population Mean Between Two Independent Samples Two data samples are independent if they come from unrelated populations and the samples does not affect each other. Here, we assume that the data populations follow the normal distribution.

Using the unpaired t-test, we can obtain an interval estimate of the difference between two population means.

Example

In the data frame column mpg of the data set mtcars, there are gas mileage data of various 1974 U.S. automobiles.

head(mtcars$mpg)

## [1] 21.0 21.0 22.8 21.4 18.7 18.1

Meanwhile, another data column in mtcars, named am, indicates the transmission type of the automobile model (0 = automatic, 1 = manual).

head(mtcars$am)

## [1] 1 1 1 0 0 0

In particular, the gas mileage for manual and automatic transmissions are two independent data populations.

Assuming that the data in mtcars follows the normal distribution, find the 95% confidence interval estimate of the difference between the mean gas mileage of manual and automatic transmissions.

We can now apply the t.test function to compute the difference in means of the two sample data.

t.test(am_1$mpg, am_0$mpg)

##

## Welch Two Sample t-test

##

## data: am_1$mpg and am_0$mpg

## t = 3.7671, df = 18.332, p-value = 0.001374

## alternative hypothesis: true difference in means is not equal to 0

## 95 percent confidence interval:

## 3.209684 11.280194

## sample estimates:

## mean of x mean of y

## 24.39231 17.14737

22

In mtcars, the mean mileage of automatic transmission is 17.147 mpg and the manual transmission is 24.392 mpg. The 95% confidence interval of the difference in mean gas mileage is between 3.2097 and 11.2802 mpg.

7.3 Comparison of Two Population Proportions

A survey conducted in two distinct populations will produce different results. It is often necessary to compare the survey response proportion between the two populations. Here, we assume that the data populations follow the normal distribution.

Example

Children from an Australian town is classified by ethnic background, gender, age, learning status and the number of days absent from school.

library(MASS) ##

## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':

##

## select

head(quine)

## Eth Sex Age Lrn Days

## 1 A M F0 SL 2

## 2 A M F0 SL 11

## 3 A M F0 SL 14

## 4 A M F0 AL 5

## 5 A M F0 AL 5

## 6 A M F0 AL 13

In effect, the data frame column Eth indicates whether the student is Aboriginal or Not (“A” or “N”), and the column Sex indicates Male or Female (“M” or “F”).

In R, we can tally the student ethnicity against the gender with the table function. As the result shows, within the Aboriginal student population, 38 students are female. Whereas within the Non-Aboriginal student population, 42 are female.

table(quine$Eth, quine$Sex)

##

## FM

## A 38 31

## N 42 35

Assuming that the data in quine follows the normal distribution, find the 95% confidence interval estimate of the difference between the female proportion of Aboriginal students and the female proportion of Non- Aboriginal students, each within their own ethnic group.

We apply the prop.test function to compute the difference in female proportions. The Yates’s continuity correction is disabled for pedagogical reasons.

prop.test(table(quine$Eth, quine$Sex), correct=FALSE)

##

## 2-sample test for equality of proportions without continuity

## correction

23

##

## data: table(quine$Eth, quine$Sex)

## X-squared = 0.0040803, df = 1, p-value = 0.9491

## alternative hypothesis: two.sided

## 95 percent confidence interval:

## -0.1564218 0.1669620

## sample estimates:

## prop 1 prop 2

## 0.5507246 0.5454545

The 95% confidence interval estimate of the difference between the female proportion of Aboriginal students and the female proportion of Non-Aboriginal students is between -15.6% and 16.7%.

8 Chi-squared Test of Independence

Many statistical quantities derived from data samples are found to follow the Chi-squared distribution. Hence we can use it to test whether a population fits a particular theoretical probability distribution.

8.1 Testing Category Probabilities: One-Way Table

In this section, we consider a multinomial experiment with k outcomes that correspond to categories of a single qualitative variable. The results of such an experiment are summarized in a one-way table. The term one-way is used because only one variable is classified. Typically, we want to make inferences about the true proportions that occur in the k categories based on the sample information in the one-way table.

A population is called multinomial if its data is categorical and belongs to a collection of discrete non- overlapping classes. Qualitative data that fall in more than two categories often result from a multinomial experiment. The characteristics for a multinomial experiment with k outcomes are described in the box.

Properties of the Multinomial Experiment

1. The experiment consists of n identical trials.

2. There are k possible outcomes to each trial. These outcomes are called classes, categories, or cells.

3. The probabilities of the k outcomes, denoted by p1, p2, c, pk, remain the same from trial to trial, w

4. The trials are independent.

5. The random variables of interest are the cell counts, n1, n2, ..., nk, of the number of observations

The chi-square goodness of fit test is used to compare the observed distribution to an expected distribution, in a situation where we have two or more categories in a discrete data. In other words, it compares multiple observed proportions to expected probabilities.

The null hypothesis for goodness of fit test for multinomial distribution is that the observed frequency fi is equal to an expected count ei in each category. It is to be rejected if the p-value of the following Chi-squared test statistics is less than a given significance level α.

Formula

Example

χ2 = (fi − ei)2 ei

To illustrate, suppose a large supermarket chain conducts a consumer-preference survey by recording the brand of bread purchased by customers in its stores. Assume the chain carries three brands of bread: two

24

major brands (A and B) and its own store brand. The brand preferences of a random sample of 150 consumers are observed, and the number preferring each brand is tabulated

Brand n

A 61 B 53 Store brand 36

ncount <- c(61, 53, 36) sum(ncount)

## [1] 150

Note that our consumer-preference survey satisfies the properties of a multino- mial experiment for the qualitative variable brand of bread.

The experiment consists of randomly sampling n = 150 buyers from a large population of consumers containing an unknown proportion p1 who prefer brand A, a proportion p2 who prefer brand B, and a proportion p3 who prefer the store brand.

• H0: the brands of bread are equally preferred • H1: At least one brand is preferred

##

## Chi-squared test for given probabilities

##

## data: ncount

## X-squared = 6.52, df = 2, p-value = 0.03839

Since the computed χ2 = 6.52 exceeds the critical value of 5.99147, we conclude at the α = .05 level of significance that a consumer preference exists for one or more of the brands of bread.

Example

For example, we collected wild tulips and found that 81 were red, 50 were yellow and 27 were white. 1. Are these colors equally common?

If these colors were equally distributed, the expected proportion would be 1/3 for each of the color.

2. Suppose that, in the region where you collected the data, the ratio of red, yellow and white tulip is 3:2:1 (3+2+1 = 6). This means that the expected proportion is:

• 3/6 (= 1/2) for red

• 2/6 ( = 1/3) for yellow

• 1/6 for white

We want to know, if there is any significant difference between the observed proportions and the expected proportions.

Statistical hypothesis

• Null hypothesis (H0): There is no significant difference between the observed and the expected value.

• Alternative hypothesis (H1): There is a significant difference between the observed and the expected value.

Answer

res <- chisq.test(ncount) res

25

1.

## [1] 158

##

## Chi-squared test for given probabilities

##

## data: tulip

## X-squared = 27.886, df = 2, p-value = 8.803e-07

The p-value of the test is 8.80310ˆ{-7}, which is less than the significance level alpha = 0.05. We can conclude that the colors are significantly not commonly distributed with a p-value = 8.80310ˆ{-7}.

## [1] 52.66667 52.66667 52.66667

2.

##

## Chi-squared test for given probabilities

##

## data: tulip

## X-squared = 0.20253, df = 2, p-value = 0.9037

The p-value of the test is 0.9037, which is greater than the significance level alpha = 0.05. We can conclude that the observed proportions are not significantly different from the expected proportions.

8.2 Testing Category Probabilities: Two-Way (Contingency) Table

In previous section, we introduced the multinomial probability distribution and considered data classified according to a single qualitative criterion. We now consider multinomial experiments in which the data are classified according to two criteria. It means, classification with respect to two qualitative factors.

For example, consider a study published in the Journal of Marketing on the impact of using celebrities in television advertisements. The researchers investigated the rela- tionship between gender of a viewer and the viewer’s brand awareness. Three hundred TV viewers were asked to identify products advertised by male celebrity spokespersons.

tulip <- c(81, 50, 27) sum(tulip)

res <- chisq.test(tulip, p = c(1/3, 1/3, 1/3)) res

# Access to the expected values

res$expected

tulip <- c(81, 50, 27)

res <- chisq.test(tulip, p = c(1/2, 1/3, 1/6)) res

Awareness

identidy no_identify total

Male Female total

95 41 136 50 114 164 145 155 300

df <- tribble(

~awareness, ~gender,~count,

26

"identidy", "M", 95,

"no_identify", "M",50,

"identidy","F",41,

"no_identify","F",114

)

df <- df %>% spread(gender, count) df

## # A tibble: 2 x 3

## awareness F M

## * <chr> <dbl> <dbl>

## 1 identidy 41.0 95.0

## 2 no_identify 114 50.0

Suppose we want to know whether the two classifications, gender and brand awareness, are dependent. If we know the gender of the TV viewer, does that information give us a clue about the viewer’s brand awareness?

##

## Pearson's Chi-squared test with Yates' continuity correction

##

## data: as.matrix(df[, 2:3])

## X-squared = 44.572, df = 1, p-value = 2.452e-11

Large values of χ imply that the observed counts do not closely agree, and hence, the hypothesis of independence is false.

Example 2

A large brokerage firm wants to determine whether the service it provides to affluent clients differs from the service it provides to lower-income clients. A sample of 500 clients is selected, and each client is asked to rate his or her broker.

chisq <- chisq.test(as.matrix(df[,2:3])) chisq

Broker rating

Outstanding Average Poor

total

<30.000 30.000-60.000

48 64 98 120 30 33 176 217

>60.000 total

41 153 50 268 16 79 107 500

df <- tribble(

~Broker, ~income,~count, "Oustanding", "<30",48, "Avegare", "<30",98, "Poor","<30",30, "Oustanding", "30-60",64, "Avegare", "30-60",120, "Poor","30-60",33, "Oustanding", ">60",41, "Avegare", ">60",50, "Poor",">60",16

)

df <- df %>% spread(income, count)

27

df

## # A tibble: 3 x 4

## Broker

## * <chr>

## 1 Avegare

## 2 Oustanding

## 3 Poor

`30-60` `<30` `>60`

<dbl> <dbl> <dbl>

120 98.0 50.0

64.0 48.0 41.0

33.0 30.0 16.0

chisq <- chisq.test(as.matrix(df[,2:4])) chisq

##

## Pearson's Chi-squared test

##

## data: as.matrix(df[, 2:4])

## X-squared = 4.2777, df = 4, p-value = 0.3697

1. Determine whether there is evidence that broker rating and customer income are dependent. The null and alternative hypotheses we want to test are

• H0: The rating a client gives his or her broker is independent of client,s income. • H1: Broker rating and client income are dependent.

This survey does not support the firm’s alternative hypothesis that affluent clients receive different broker service than lower-income clients.

8.3 Chi-squared Test of Independence

The chi-square test of independence is used to analyze the frequency table (i.e. contengency table) formed by two categorical variables. The chi-square test evaluates whether there is a significant association between the categories of the two variables.

## Wife Alternating Husband Jointly

file_path <- "http://www.sthda.com/sthda/RDoc/data/housetasks.txt" housetasks <- read.delim(file_path, row.names = 1) head(housetasks)

## Laundry 156

## Main_meal 124

## Dinner 77

## Breakfeast 82

## Tidying 53

## Dishes 32

library("gplots") ##

14 2 4

20 5 4

11 7 13

36 15 7

11 1 57

24 4 53

## Attaching package: 'gplots'

## The following object is masked from 'package:stats':

##

## lowess

# 1. convert the data as a table

dt <- as.table(as.matrix(housetasks)) # 2. Graph

28

balloonplot(t(dt), main ="housetasks", xlab ="", ylab="", label = FALSE, show.margins = FALSE)

housetasks

Wife

Alternating

Husband

Jointly

Laundry Main_meal Dinner Breakfeast Tidying Dishes Shopping Official Driving Finances Insurance Repairs Holidays

Chi-square test examines whether rows and columns of a contingency table are statistically significantly associated.

• Null hypothesis (H0): the row and the column variables of the contingency table are independent. • Alternative hypothesis (H1): row and column variables are dependent

##

## Pearson's Chi-squared test

##

## data: housetasks

## X-squared = 1944.5, df = 36, p-value < 2.2e-16

In our example, the row and the column variables are statistically significantly associated (p-value = 0).

8.3.1 Nature of the dependence between the row and the column variables

If you want to know the most contributing cells to the total Chi-square score, you just have to calculate the Chi-square statistic for each cell: Pearson residuals (r) for each cell

library(corrplot)

## corrplot 0.84 loaded

## Wife Alternating Husband Jointly

chisq <- chisq.test(housetasks) chisq

# Contibution in percentage (%)

contrib <- 100*chisq$residuals^2/chisq$statistic round(contrib, 3)

## Laundry 7.738

## Main_meal 4.976

## Dinner 2.197

0.272 1.777 2.246

0.012 1.243 1.903

0.073 0.600 0.560

29

## Breakfeast 1.222

0.615 0.408

0.133 1.270

0.178 0.891

0.090 0.581

3.771 0.010

2.403 3.374

0.037 0.028

0.941 0.868 1.683

0.947 21.921 2.275

1.098 1.233 12.445

## Tidying

## Dishes

## Shopping

## Official

## Driving

## Finances

## Insurance 1.705

## Repairs 2.919

## Holidays 2.831

1.443

0.661

0.625

0.586

0.311

1.789

1.700

0.149

0.063

0.085

0.688

1.538

0.886

corrplot(contrib, is.cor = FALSE)

Laundry Main_meal Dinner Breakfeast Tidying Dishes

21.92 19.73 17.54 15.35 13.16

Shopping 10.97

Official

Driving Finances Insurance Repairs Holidays

8.77

6.58 4.39 2.2

01

0.

• In the image above, it’s evident that there are an association between the column Wife and the rows Laundry, Main_meal.

• There is a strong positive association between the column Husband and the row Repair

9 Non Parametric tests

A statistical method is called non-parametric if it makes no assumption on the population distribution or sample size.

This is in contrast with most parametric methods in elementary statistics that assume the data is quantitative, the population has a normal distribution and the sample size is sufficiently large.

In general, conclusions drawn from non-parametric methods are not as powerful as the parametric ones. However, as non-parametric methods make fewer assumptions, they are more flexible, more robust, and

30

Wife Alternating

Husband Jointly

applicable to non-quantitative data.

9.1 Sign Test

A sign test is used to decide whether a binomial distribution has the equal chance of success and failure.

Example

A soft drink company has invented a new drink, and would like to find out if it will be as popular as the existing favorite drink. For this purpose, its research department arranges 18 participants for taste testing. Each participant tries both drinks in random order before giving his or her opinion.

It turns out that 5 of the participants like the new drink better, and the rest prefer the old one. At .05 significance level, can we reject the notion that the two drinks are equally popular?

The null hypothesis is that the drinks are equally popular. Here we apply the binom.test function. As the p-value turns out to be 0.096525, and is greater than the .05 significance level, we do not reject the null hypothesis.

binom.test(5, 18)

##

## Exact binomial test

##

## data: 5 and 18

## number of successes = 5, number of trials = 18, p-value = 0.09625

## alternative hypothesis: true probability of success is not equal to 0.5

## 95 percent confidence interval:

## 0.09694921 0.53480197

## sample estimates:

## probability of success

## 0.2777778

At .05 significance level, we do not reject the notion that the two drinks are equally popular.

9.2 Wilcoxon-Whitney-Wilcoxon Test

The paired samples Wilcoxon test (also known as Wilcoxon signed-rank test) is a non-parametric alternative to paired t-test used to compare paired data. It’s used when your data are not normally distributed. This tutorial describes how to compute paired samples Wilcoxon test in R.

9.3 Mann-Whitney-Wilcoxon Test

Here, we’ll use an example data set, which contains the sales before and after the treatment (discount).

# Weight of the mice before treatment

before <-c(200.1, 190.9, 192.7, 213, 241.4, 196.9, 172.2, 185.5, 205.2, 193.7) # Weight of the mice after treatment

after <-c(392.9, 393.2, 345.1, 393, 434, 427.9, 422, 383.9, 392.3, 352.2)

# Create a data frame

my_data <- data.frame(

group = rep(c("before", "after"), each = 10),

sales = c(before, after) )

We want to know, if there is any significant difference in the median sales before and after treatment? 31

library("dplyr") ##

## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':

##

## filter, lag

## The following objects are masked from 'package:base':

##

## intersect, setdiff, setequal, union

## # A tibble: 2 x 4

## group count median IQR

## <fctr> <int> <dbl> <dbl>

## 1 after 10 393 28.8

## 2 before 10 195 12.6

Question : Is there any significant changes in the sales before after treatment?

##

## Wilcoxon signed rank test

##

## data: sales by group

## V = 55, p-value = 0.001953

## alternative hypothesis: true location shift is not equal to 0

The p-value of the test is 0.001953, which is less than the significance level alpha = 0.05. We can conclude that the median sales o before treatment is significantly different from the median sales after treatment with a p-value = 0.001953.

9.4 Kruskal-Wallis Test

Kruskal-Wallis test by rank is a non-parametric alternative to one-way ANOVA test, which extends the two-samples Wilcoxon test in the situation where there are more than two groups. It’s recommended when the assumptions of one-way ANOVA test are not met.

Example

Here, we’ll use the built-in R data set named PlantGrowth. It contains the weight of plants obtained under a control and two different treatment conditions.

## weight group ##1 4.17ctrl

group_by(my_data, group) %>% summarise(

count = n(),

median = median(sales, na.rm = TRUE), IQR = IQR(sales, na.rm = TRUE)

)

res <- wilcox.test(sales ~ group, data = my_data, paired = TRUE) res

my_data <- PlantGrowth head(my_data)

32

##2 5.58ctrl ##3 5.18ctrl ##4 6.11ctrl ##5 4.50ctrl ##6 4.61ctrl

Summary by group

## # A tibble: 3 x 6

## group count mean sd median IQR

## <fctr> <int> <dbl> <dbl> <dbl> <dbl>

group_by(my_data, group) %>% summarise(

count = n(),

mean = mean(weight, na.rm = TRUE),

sd = sd(weight, na.rm = TRUE),

median = median(weight, na.rm = TRUE), IQR = IQR(weight, na.rm = TRUE)

)

## 1 ctrl

## 2 trt1

## 3 trt2

10 5.03 0.583 5.15 0.743

10 4.66 0.794 4.55 0.662

10 5.53 0.443 5.44 0.467

We want to know if there is any significant difference between the average weights of plants in the 3 experimental conditions.

kruskal.test(weight ~ group, data = my_data)

##

## Kruskal-Wallis rank sum test

##

## data: weight by group

## Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842

As the p-value is less than the significance level 0.05, we can conclude that there are significant differences between the treatment groups.

From the output of the Kruskal-Wallis test, we know that there is a significant difference between groups, but we don’t know which pairs of groups are different.

It’s possible to use the function pairwise.wilcox.test() to calculate pairwise comparisons between group levels with corrections for multiple testing.

## Warning in wilcox.test.default(xi, xj, paired = paired, ...): cannot

## compute exact p-value with ties

##

## Pairwise comparisons using Wilcoxon rank sum test

##

## data: PlantGrowth$weight and PlantGrowth$group

##

## ctrl trt1

## trt1 0.199 -

## trt2 0.095 0.027

##

pairwise.wilcox.test(PlantGrowth$weight, PlantGrowth$group, p.adjust.method = "BH")

33

## P value adjustment method: BH

The pairwise comparison shows that, only trt1 and trt2 are significantly different (p < 0.05).

34

Author

You have successfully subscribed to our newsletters.

Copyright 2018 YourStory Media Pvt. Ltd.

v0.0.198