Wikipedia:Reference desk/Archives/Mathematics/2015 May 31

= May 31 =

Question in Statistics
I am trying to solve a Question in Statistics, for which we are using R and SAS, and it is about a Survey of a number of women, giving facts about themselves to determine whether or not they have Diabetes. We were given a Training Set of 200 people, then a test set of a further 332, and my understanding in Classification, is the training set is used to get a Model or equation to determine membership of either the group that has diabetes, or the one that does not. We assigned zero for no Diabetes, and 1 if the Lady did have Diabetes. We ran code given to us, and had to answer a number of questions which I did until the last, and this was to be given details of one extra woman, and to work out whether or not she either had diabetes or could be predicted to have it, and I am not sure how to do it. We carried out a Linear Discriminant Analysis, a Logistic Regression and a Quadratic Discriminant Analysis, and the summaries of the LDA and Logistic, which we are to use, are as follows, where below I have decided to show the whole Code : Assignment 4 > setwd("P:/STAT315")  > pima<- read.table("pima.txt",header=TRUE) >pima$type <- factor(pima$type) > pima_test <- read.table("pima_test.txt", header=TRUE) > pima_test$type <- factor(pima_test$type) > # Linear Discriminant Analysis > library(MASS) > (pima_lda <- lda(type ~ npreg + glu + bp + skin + bmi + ped + age, data=pima, prior=c(0.66, 0.34))) Call : lda(type ~ npreg + glu + bp + skin + bmi + ped + age, data = pima, prior = c(0.66, 0.34)) Prior probabilities of groups:  0    1 0.66 0.34 Group means:        npreg      glu       bp        skin      bmi       ped        age 0  2.916667  113.1061   69.54545  27.20455  31.07424  0.4154848   29.23485  1   4.838235  145.0588   74.58824  33.11765  34.70882  0.5486618   37.69118 Coefficients of linear discriminants:  LD1 npreg 0.0794995781   glu  0.0240316424   bp -0.0018125857   skin -0.0008317413  bmi    0.0494891916 ped   1.2530603130   age    0.0314375125 > tab <- table(pima$type, predict(pima_lda)$class) > (tab[1,2] + tab[2,1])/sum(tab) [1] 0.23 >tabtest<- table(pima_test$type, predict(pima_lda, newdata=pima_test)$class) > (tabtest[1,2] + tabtest[2,1])/sum(tabtest) [1] 0.2018072 > library(ipred) > mypredict.lda <- function(object, newdata) predict(object, newdata = newdata)$class > errorest(type ~ npreg + glu + bp + skin + bmi + ped + age, data=pima, model=lda, estimator="cv", predict=mypredict.lda, est.para=control.errorest(k=199)) Call: errorest.data.frame(formula = type ~ npreg + glu + bp + skin +    bmi + ped + age, data = pima, model = lda, predict = mypredict.lda,     estimator = "cv", est.para = control.errorest(k = 199)) 199-fold cross-validation estimator of misclassification error Misclassification error: 0.245 > # Logistic Regression > lmod <- glm(type ~ npreg + glu + bp + skin + bmi + ped + age, data=pima, family=binomial) > summary(lmod) Call: glm(formula = type ~ npreg + glu + bp + skin + bmi + ped + age, family = binomial, data = pima) Deviance Residuals:   Min     1Q      Median     3Q      Max -1.9830 -0.6773  -0.3681   0.6439   2.3154  Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -9.773062  1.770386  -5.520 3.38e-08 *** npreg       0.103183   0.064694   1.595  0.11073 glu         0.032117   0.006787   4.732 2.22e-06 *** bp         -0.004768   0.018541  -0.257  0.79707 skin       -0.001917   0.022500  -0.085  0.93211 bmi         0.083624   0.042827   1.953  0.05087. ped         1.820410   0.665514   2.735  0.00623 ** age         0.041184   0.022091   1.864  0.06228. Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 256.41 on 199  degrees of freedom Residual deviance: 178.39 on 192  degrees of freedom AIC: 194.39 Number of Fisher Scoring iterations: 5 > pclass <- predict(lmod, newdata=pima_test, type="response") > 0.5 > pclass <- predict(lmod, newdata=pima_test, type="response") > tabtestlogistic <- table(pima_test$type, pclass) > (tabtestlogistic[1,2] + tabtestlogistic[2,1])/sum(tabtestlogistic) [1] 0.003012048 > tabtestlogistic <- table(pima_test$type, pclass) > (tabtestlogistic[1,2] + tabtestlogistic[2,1])/sum(tabtestlogistic) [1] 0.1987952 > setwd("P:/STAT315") > pima <- read.table("pima.txt", header=TRUE) > pima$type <- factor(pima$type) > pima_test <- read.table("pima_test.txt", header=TRUE) > pima_test$type <- factor(pima_test$type) > library(MASS) > (pima_qda <- qda(type ~ npreg + glu + bp + skin + bmi + ped + age, data=pima, prior=c(0.66, 0.34))) Call: qda(type ~ npreg + glu + bp + skin + bmi + ped + age, data = pima, prior = c(0.66, 0.34)) Prior probabilities of groups :   0    1 0.66 0.34 Group means :    npreg    glu      bp       skin    bmi     ped        age 0  2.916667 113.1061 69.54545  27.20455 31.07424  0.4154848  29.23485           1   4.838235 145.0588 74.58824  33.11765 34.70882  0.5486618  37.69118 > tabq <- table(pima$type, predict(pima_qda)$class) > (tabq[1,2] + tabq[2,1])/sum(tabq) [1] 0.23 > tabqtest <- table(pima_test$type, predict(pima_qda, newdata=pima_test)$class) > (tabqtest[1,2] + tabqtest[2,1])/sum(tabqtest) [1] 0.2289157 > library(ipred) > mypredict.qda <- function(object, newdata) predict(object, newdata = newdata)$class > errorest(type ~ npreg + glu + bp + skin + bmi + ped + age, data=pima, model=qda, estimator="cv", predict=mypredict.qda, est.para=control.errorest(k=199)) Call : errorest.data.frame(formula = type ~ npreg + glu + bp + skin +    bmi + ped + age, data = pima, model = qda, predict = mypredict.qda,     estimator = "cv", est.para = control.errorest(k = 199))
 * 1) first, set the working directory to the data file location  (this can be easily done in RStudio Menu/Session/Set working directory or by using setwd("~/path to working directory/")) import the ' ' separated .txt files
 * 1) Variable tab the Name given to the two by two Table from the Pima Type Training Set as shown here
 * 1) From the two by two in the Training Set Table of those with Diabetes and those Without, add
 * 2) Row One Column Two to Row Two Column one, then divide by Total Number of Women, to get
 * 3)  the Training Error for the Linear Discriminant Analysis Model
 * 1) Variable tabtest the Name given to the two by two Table from the Pima Type Test Set as shown here
 * 1) From the two by two in the Test Set Table of those with Diabetes and those Without, add
 * 2) Row One Column Two to Row Two Column one, then divide by Total Number of Women, in
 * 3)  order to obtain  the Error on a Test Set for the Linear Discriminant Analysis Model
 * 1) Cross Validation using the Package ipred
 * 1) Or :> pclass<-predict(lmod,newdata=pima_test,type="response")>.5
 * 1) This Time using the available Data to fit a Quadratic Discriminant Analysis Model
 * 1) Setting up the Quadratic Discriminant Analysis Model
 * 1) Summary of the Model
 * 1) Variable tabq the Name given to the two by two Table from the Pima Type Training Set as shown here,             #  but this time for the qda
 * 1) From the two by two in the Training Set Table of those with Diabetes and those Without, add              # Row One Column Two to Row Two Column one, then divide by Total Number of Women,                   #  this  occasion  to get  the Training Error for the Quadratic Discriminant Analysis Model
 * 1) Variable tabqtest is the Name given to the two by two Table from the Pima Type Test Set as shown here,             #  but now it is for the qda
 * 1) From the two by two in the Test Set Table of those with Diabetes and those Without, add              # Row One Column Two to Row Two Column one, then divide by Total Number of Women,                   #  on this  occasion  to get  the  Error on a Test Set for the Quadratic Discriminant Analysis Model
 * 1) Cross Validation using the Package ipred for Quadratic Discriminant Analysis

199-fold cross-validation estimator of misclassification error

Misclassification error: 0.275

Now my understanding is that for the Logistic, I take the coefficients in the estimates column, and multiply each by the actual data values for this one particular woman, but I am not sure if I use the intercept all seven times, or once or not at all, then the number I find I raise to the power of e, and divide this by this same number to the power of e plus 1, to undo the logit ( expit ). The data for the woman in question is : npreg	glu	bp	skin	bmi	ped	age

5	111	81	33	25.1	0.36	58

which are the seven explanatory variables, and type, either 0 for no Diabetes and 1 for Diabetes, is the Response. In LDA we are told to take the coefficients and multiply each by the values for the woman above and see if it is greater than zero, which here it is, but I do not know what that signifies. I also did work in SAS, which gives two sets of coefficients, 0 for no Diabetes and 1 for Diabetes, and we multiply each of the woman's values by each of the coefficients, and here the value relevant to 0 was greater than the one I worked out for 1, so this suggests to me this Lady will not get Diabetes, or at least not be said to have it. This SAS Data is as follows : Calculations for 0 with respect to no Diabetes : -35.51043-0.17897×5+111×0.09573 +81×0.44203-0.26259×33+25.1× 0.96574  +.36× 4.78151 +0.14675×58  =  35.83263 While the Calculations to do with 1 for there being Diabetes present are as follows : -46.10679-0.05820×5+111×0.13224+81×0.43928-33×0.26386+1.04092×25.1+.36×6.68513+0.19451×58  =  34.97047 Also, I did not understand why in the training set there were some women that were misclassified, as well as in the Test Set, when I thought the Training set was meant to be good enough to predict the test set. Sorry for the longness of my Question How do I sort this out ? Thanks Chris the Russian Christopher Lilly  11:13, 31 May 2015 (UTC)