User:Mark W. Miller

American Civil War

 * Battle of Salineville (written before registering, editted by User:Scott Mingus)

As of June 2008 the article still largely reflects my contributions. I have always felt a little bad about mentioning events that surrounded the surrender.

Math and Statistics

 * Matrix population models (written)

As of June 2008 the article still largely reflects my contributions, which wasn't much. The article remains only a stub.

Biometrics

 * Mark and recapture (extensively rewritten)

As of June 2008 the article still largely reflects my contributions. Capture-recapture is an enormous field that presently involves some extremely complex models. However, the Lincoln-Petersen method is basically the beginning and the method is still used. The references at the bottom of the page will expose readers to the broader world of capture-recapture if they are interested. Quite a few good books are available on the subject. Some can be downloaded for free from the internet.

Sports

 * Aaron Smith (football player) (extensively rewritten)

As of June 2008 the article still largely reflects my contributions. Smith's stats are up-to-date, as is his current situation with the Pittsburgh Steelers. Many other people have contributed to his page since I added to it.

Newton-Raphson Method and Hessian Matrix
NOTE TO SELF: THERE WAS A TYPO IN THE HESSIAN MATRIX IN THE BOOK. THE INVERSE OF THE HESSIAN IS CORRECT, BUT THE HESSIAN ITSELF WAS NOT.

THE HESSIAN NOW IS CORRECT BELOW.

I posted a question recently about the Hessian Matrix on the Math Talk page and received some good replies. I posted a follow-up question later in the Math and Science Question section, but then found the error in the book myself. Below deals with using the Hessian Matrix and Newton-Raphson Method to locate the minima or maxima of a function. The example in REA's Problem Solvers book "Operations Research", p. 739-740, contains an error in the Hessian Matrix.

The function in the example contains two unknowns, x1 and x2, and is:

4x12 + 2x1x2 + 2x22 + x1 + x2.

The problem is to find the values of x1 and x2 that provide the minima.

The matrix of partial derivatives is: f(x) =  $$\begin{bmatrix}8x1 + 2x2 + 1 \\2x1 + 4x2 + 1\end{bmatrix}$$ = 0.

The Hessian Matrix is:

H(x) = $$\begin{bmatrix}8&2\\2&4\end{bmatrix}$$;

Not:

H(x) = $$\begin{bmatrix}8&2\\8&4\end{bmatrix}$$ as in the book.

The third matrix is indeed the inverse of the Hessian Matrix, and is given correctly in the book as:

H-1(x) = (1/14) * $$\begin{bmatrix} 2&-1\\-1&4\end{bmatrix}$$.

The interative formula used to find the values of x1 and x2 corresponding to the minima of the original function is:

x(n+1) = x(n) - H-1(x) * f(x);

and gives the values x(n+1)1 = -1/14 and x(n+1)2 = -3/14.

I understand, at least mechanically, the substitutions and algebra used to arrive at those answers given the formula for x(n+1).

Note that:

H(x) * H-1(x) = $$\begin{bmatrix} 1&0\\0&1\end{bmatrix}$$.

Note that the eigenvalues of the Hessian Matrix are 8.8284271 and 3.1715729, which I think corresponds to a minima, which is what the REA "Operations Research" book said was the case with this particular function.

Brand New to LaTeX and MiKTeX
Installed MiKTeX on an Windows XP computer. The MiKTeX download included LaTeX.

http://www.latex-project.org/

http://miktex.org/

I opened a DOS window and went to the folder containing 'latex.exe':

C:\Program Files\MiKTex 2.7\mikTex\bin>

then typed 'latex hello':

C:\Program Files\MiKTex 2.7\mikTex\bin>latex hello

and it worked.

Mark W. Miller (talk) 06:57, 19 May 2009 (UTC)

Creating a PDF file using LaTeX
The steps immediately above worked to create a file called 'hello.dvi'. I was able to convert 'hello.dvi' into a PDF file using the following two steps:

Step 1. Create a PostScript file:

C:\Program Files\MiKTex 2.7\mikTex\bin>dvips hello.dvi -o hello.ps

Step 2. Create a PDF file using the PostScript file from the first step:

C:\Program Files\MiKTex 2.7\mikTex\bin>ps2pdf hello.ps hello.pdf

Note that the file 'hello.tex' was also in the folder:

C:\Program Files\MiKTex 2.7\mikTex\bin>

as were the executable files 'dvips.exe', 'ps2pdf.exe' and 'latex.exe' which were installed automatically when I installed 'MiKTeX 2.7'.

The file 'hello.tex' was obtained via an internet search, copied, and pasted into MS Wordpad and saved on my computer in plain text format.

Mark W. Miller (talk) 08:17, 7 June 2008 (UTC)

Graphical User Interface for LaTeX
Installed TeXnicCenter which serves as a shell or text editor or graphical user interface for LaTeX and created another version of 'hello.pdf' from within the TeXnicCenter application.

http://www.toolscenter.org/

Mark W. Miller (talk) 12:36, 7 June 2008 (UTC)

Questions I Have Posted at the Mathematics Reference Desk
Below are some questions I have posted at the Mathematics and Computing Reference Desks. The answers were helpful and I have periodically gone back to refresh my memory.

1. Derivative and partial derivative of what I called the inverse of the logit function.

http://en.wikipedia.org/wiki/Wikipedia:Reference_desk/Archives/Mathematics/2007_March_19#derivative_of_the_inverse_of_a_logit_function

Note that the answer was vandalized, but I do not think the content of the question or answer was changed.

2. Estimating the variance-covariance matrix of a multinomial function

http://en.wikipedia.org/wiki/Wikipedia:Reference_desk/Archives/Mathematics/2007_September_28

Note that I answered my own question. I believe my answer is correct. Hopefully so. If somebody types "Estimating the variance-covariance matrix of a multinomial function" into Google, the first hit returned is my question and answer!

3. Poisson Distribution

http://en.wikipedia.org/wiki/Wikipedia:Reference_desk/Archives/Mathematics/2007_September_22

This question dealt with how to determine whether a variable follows a Poisson Distribution.

4. Displaying the gradient when optimizing in R (programming language)

http://en.wikipedia.org/wiki/Wikipedia:Reference_desk/Archives/Computing/2008_May_31#Displaying_the_gradient_when_optimizing_in_R

Note that this Question #4 was posted at the Computing Reference Desk, not the Mathematics Reference Desk. The question was not answered there.

Here is a possible solution I have not yet tried:

http://tolstoy.newcastle.edu.au/R/e6/help/09/02/5786.html

Also, I might try the spg function. I think the optimization function nlm in R will show the gradient too.

Example of Double Integration with Program R
Program R allows you to perform double and triple integration. I had trouble locating example R code for double integration, so I provide an example here. The function is f(x,y) = x^2 * y; x ranges from -1 to 2 and y ranges from 0 to 2; and the answer is 6.

library(adapt)

fint <- function(p) {

x = p[1]

y = p[2]

x^2 * y }

adapt(ndim = 2, lower = c(-1,0), upper = c(2,2), functn = fint)

Example of Triple Integral in Program R
R code for a triple integral taken from page 808 of Ellis and Gulick, second edition (1982). The answer is -54.

library(adapt)

fint <- function(p) {

x = p[1]

y = p[2]

z = p[3]

y - x * z }

adapt(ndim = 3, lower = c(2,-1,0), upper = c(4,1,3), minpts=10000, functn = fint, eps = 1e-6)

Mark W. Miller (talk) 07:32, 19 May 2009 (UTC)

Reading an Excel 2007 file into Program R
The R code below reads (or imports) and displays data from an Excel 2007 file named 'test.xlsx' on the U drive.

library(RODBC)

channel <- odbcDriverConnect("DRIVER=Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb); DBQ=U:\\test.xlsx; ReadOnly=False") sqlTables(channel)

my.data <- sqlFetch(channel, "Sheet1")

my.data

Mark W. Miller (talk) 00:27, 30 October 2009 (UTC)

Add Confidence Intervals to a Plot in Program R when you Have Three Columns of Numbers
The code below adds confidence intervals to a plot in Program R when you, in effect, enter the values of the confidence intervals by hand instead of having them generated by a linear model statement, for example.

The program uses a loop to plot the points and their confidence intervals one at a time. The arrow statement is used to add the confidence intervals and the point statement is used to add their point estimates.

I learned how to do this by dissecting the R code at the following site:

http://yihui.name/en/2007/10/demonstration-of-confidence-intervals-using-r/

I am not sure why there are dashed lines around sections of the code, but if you copy all of the code and paste it into Program R it should work. Then you can modify it to suit your purposes.


 * 1)   the first two lines determine the size and position of the plot, default is a large box


 * 1)   par(mar = c(4.5, 4, 2.5, 0.5))
 * 2)   layout(matrix(1:2, 2), heights = c(0.6, 0.4))

n = 5                                # number of points on the graph

m = c(2,3,4,5,6)                     # point estimates

y0.a = c(0.25, 0.5, 0.75, 1.0, 1.25) # value to subtract from m for lower CI        y1.a = c(0.25, 0.5, 0.75, 1.0, 1.25)  # value to add to m for upper CI

y0 = m - y0.a                        # lower 95% confidence interval y1 = m + y1.a                        # upper 95% confidence interval

# the plot statement below creates the plot border and labels

plot(1, xlim = c(0.5, n + 0.5), ylim = c(min(y0), max(y1)),           type = "n", xlab = "Samples", ylab = "Measurement", main = "My Simple Title")

# the loop below adds the points and their CI bars one point at a time

for (i in 1:n) { arrows(i, y0[i], i, y1[i], length = 0.05, angle = 90, code = 3, col = "black")

points(i, m[i]) }

Mark W. Miller (talk) 01:49, 23 December 2009 (UTC)

View or Read Source Code in Program R
To see the source code for a function in Program R download the package containing the function. Specifically, download the file that ends in "tar.gz". This is a compressed file. Expand the compressed file using, for example, "WinZip". Now you need to open the uncompressed file that ends in ".tar". Download the free software "7-Zip". Click on the file "7zFM.exe" and navigate to the directory containing the ".tar" file. You can extract the contents of that ".tar" file into a new folder. The contents consist of R files showing the source code for the functions in the R package.

Mark W. Miller (talk) 18:53, 28 September 2010 (UTC)

Program R code for a Bayesian Analysis of Coin Flips
The Program R code below calls WinBUGS to estimate the probability of obtaining heads given that 17 heads have been observed in 20 flips. Note that WinBUGS returns a point estimate for p of 0.819 rather than 0.85. If the number of heads and flips is multiplied by 1000 to 1700 and 2000 then the point estimate of p will be much closer to 0.85. You can copy the code below and paste in directly into Program R and it should run if you also have WinBUGS installed and have installed the R2WinBUGS R package. You will probably have to change, or create, the directory I use in the sink statement and the bugs statement.

library("R2WinBUGS")

r <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0)

my.data  <- list ( "r" )

my.params <- c( "p" )

my.inits <- function{list( p = 0.85 )}

sink("c:/Bayesian coin flip/bayesian coin flip.txt")

cat("

model {

for(i in 1:20) {

r[i] ~ dbin(p,1)

}

p ~ dunif(0,1)

}

",fill=TRUE)

sink fit   <- bugs (my.data, my.inits, my.params, "c:/Bayesian coin flip/bayesian coin flip.txt",

n.chains=3, n.iter=100000, n.burnin=50000, n.thin = 5, debug=TRUE )

print(fit, digits=5)

Mark W. Miller (talk) 07:51, 14 June 2011 (UTC)

C Code Input Output File or C Code Read Write File
C Code to Read a Data File, Perform a Calculation, and Write a Data File

The C code below reads an input data file, performs a simple calculation on the data, and writes the data and the result of the calculation to a new output data file. The C code also displays the data and the result of the calculation on the computer screen to help in debugging the code if an error should occur.

In this example the data file I used is called "mydata.dat" which consisted solely of a single column of numbers 1-20 (i.e., with no header and no delimiter). However, nothing in the C code below specifies that there are 20 observations in the data file. You can use any column of integers you wish (and as many integers in the column as you wish) in your input data file.

Note that I am not clear on how to format text for Wikipedia. The code below is correct. I am not sure why some lines below are displayed inside a box and some lines are not. I think the indented lines below are displayed inside a box.

Nevertheless, you should be able to copy the code below directly from the computer screen, paste it into a text file named, for example, "readwrite.c", and the code should compile to an .exe file using, for example, Jens' File Editor. The resulting .exe file should read, calculate, display and write when executed (i.e., when clicked on).

#include 

#include 

char quit;

main

{ FILE *md, *mdout;

int i;

int ii ;

if (((md   = fopen("mydata.dat" , "r")) == NULL) |     ((mdout = fopen("myoutput.dat", "w")) == NULL)) printf("a file could not be opened\n");

else {

printf("%-10s%s\n", "I", "(I + 1000)"); fscanf(md, "%d", &i);

while (!feof(md)) {

ii = i + 1000 ;

printf("%-10d%d\n", i, ii);

fprintf(mdout, "%-10d%d\n", i, ii);

fscanf(md, "%d", &i);

} }

printf("To close this program type 'quit' and hit the return key\n"); printf("              \n");

scanf("%d", &quit);

return 0;

}

Mark W. Miller (talk) 19:49, 19 June 2011 (UTC)

R Code For Multiplying a Vector and a Matrix
If you want to multiply a vector and a matrix you need to remember that 'two matrices can be multiplied only when the number of columns in the first equals the number of rows in the second' (quoted text from the Wikipedia article titled 'Matrix (mathematics)').

I wanted to multiply the column matrix 'a':

0.4 0.6

and the 4x4 matrix 'b':

0.3 0.45 0.7 0.55

I expected the answer to be:

0.39 0.61

However, I received an error: 'Error in a %*% b : non-conformable arguments' when I typed:

a %*% b

I obtained the answer I expected by switching the order of a and b:

b %*% a

Below is the R code:

a <- matrix(c(0.4,0.6), nrow=2)

a

b <- matrix(c(0.3, 0.7, 0.45, 0.55), nrow=2, byrow=F)

b

a %*% b

b %*% a

Mark W. Miller (talk) 22:22, 10 December 2011 (UTC)

R Code for Entering a Data Set from the Keyboard when both Numeric and Character Variables are Present
Most data sets seem to have both numeric and character variables. However, entering such data sets into Program R with the keyboard is difficult. Matrices in R require that all variables be the same type or mode (e.g., variables either are all numeric or all character). Below is R code showing an example of how to convert a variable from character to numeric after first entering the data into a matrix using the keyboard. This seems to be a common problem, but I have not seen the solution presented elsewhere.

First enter the data into a matrix. Then convert the matrix to a data frame while specifying that none of the variables are to be treated as factors. Instead all of the variables will be considered categorical. This is specified with the stringsAsFactors=F statement as shown below. Then use the aa[,1] <- as.numeric(aa[,1]) statement to convert, in this example, the first column from character to numeric. Now you have a data frame with one numeric and two character variables. One alternative is to use the read.table statement to read the data from an external file. However, the code below shows how to do it when data are entered directly from the keyboard.

aa <- matrix(c(

26,1,"L",

70,1,"L",

52,1,"L",

51,1,"L",

67,1,"L",

18,1,"M",

35,1,"M",

30,1,"M",

36,1,"M",

36,1,"H",

21,1,"H",

43,1,"H",

26,1,"H",

27,2,"L",

14,2,"L",

42,2,"M",

26,2,"M",

20,2,"H",

21,2,"H",

24,2,"H",

28,3,"H"),

nrow = 21, ncol=3, byrow=T)

aa <- as.data.frame(aa, stringsAsFactors=F)

colnames(aa) <- c("abundance", "variable1", "variable2")

str(aa)

aa[,1] <- as.numeric(aa[,1])

str(aa)

zz1 <- by(aa[, 1], aa[,2:3], max)

zz1

Mark W. Miller (talk) 23:17, 21 December 2011 (UTC)

C Code to Enter, Fill, Load, Put or Place Numeric Data into an Array or Matrix
If you want to put static numeric data into an array in C then perhaps try the following example of a 2-dimensional array with 3 rows and 4 columns:

double myarray[3][4] = {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10, 11, 12}};

Mark W. Miller (talk) 04:10, 6 March 2012 (UTC)

Running Compiled C Code and Exporting to a Text Output File
Here is an alternative way to export output from a C file to a text file. I wrote the standard 'hello, world' file using Jens' File Editor and saved the file as 'helloworld.c'. Then clicked on the 'CompileLink' tab in Jens' File Editor to create the executable 'helloworld.exe'. Here are the contents of 'helloworld.c' without indentation.

#include 

main

{

printf("hello, world\n");

}

Clicking on 'helloworld.exe' will cause the program to flash the expression 'hello, world' on the screen too quickly to be read. At least I think so. The flash is so brief I cannot be certain what is printed on the screen.

I next opened a command window in Windows by typing 'cmd' in the box in the lower left corner of the screen under the Windows icon. Suppose the files 'helloworld.exe' and 'helloworld.c' are in a folder 'c:\c stuff'. In the command window change directory to 'c:\c stuff' and then create the line 'c:\c stuff\helloworld i=helloworld.c > helloworld.out'. That will create an output text file 'helloworld.out' containing the expression 'hello, world'.

I am using the free MinGW gnu C compiler. Jens' File Editor is also free. MinGW is in the root (c:\) directory. Jens' File Editor is not. On one computer I had to alter the path statements in the Jens' File Editor file 'jfe.ini' before Jens' File Editor would compile C code. The path statements in 'jfe.ini', I believe, must point to the location of the MinGW application.

Mark W. Miller (talk) 21:08, 19 March 2012 (UTC)

R Code for Coin Flips with Frequentist Logistic Regression
Below is R code for estimating the probability of getting heads if you flip a coin 100 times and get 20 heads. The code uses frequentist logistic regression. The point estimate of the probability is obtained three different way: 1) e(x) / (1+e(x)), 2) plogis and 3) predict. The third approach, predict, also returns the standard error of the point estimate of the probability by setting se.fit=T.

N <- 100

C1 <- 20

my.model1 <- glm(cbind(C1, N - C1) ~ 1, family = binomial)

my.model1

exp(as.numeric(my.model1[1])) / (1 + exp(as.numeric(my.model1[1])))

plogis(as.numeric(my.model1[1]))

my.parm1 <- predict(my.model1, type="response", se.fit=T)

as.numeric(my.parm1[1:2])

my.logLik1 <- logLik(my.model1)

my.logLik1

Mark W. Miller (talk) 09:17, 11 July 2012 (UTC)

R Code for Estimating Probability of Heads when Simulating Coin Flips
Suppose you want to estimate the probability of obtaining heads by flipping a coin, and you do not want to use either frequentist or Bayesian logistic regression. You could use the R code below to obtain both a point estimate of the probability of obtaining heads and an estimate of the standard error of the probability of obtaining heads.

Suppose the true probability of obtaining heads is 0.20. For an individual attempt at estimating that true probability you flip the coin 100 times and record the number of heads divided by the number of coin flips. That quantity is stored in a vector called 'est.p' in the R code below. I refer to 100 flips as an 'iteration' or a 'set'. Now you must repeat your set (your 100 flips) numerous times and for each set record the number of heads divided by 100 (the number of coin flips per set in this example). The R code below simulates one million sets with 100 coin flips in each set. In the R code below 'iters' is the number of sets, 'N' is the number of coin flips per set, and 'p.heads' is the true probability of obtaining heads on any given coin flip (which is used to simulate the data).

After repeating a large number of sets your point estimate of the probability of obtaining heads is the mean of the vector est.p (in other words, the mean of the one million values stored in the vector 'est.p'.). The estimate of the standard error of the probability of obtaining heads is the standard deviation of the one million values stored in the vector 'est.p'.

If you increase the number of sets from one million to ten million or 100 million your estimates obtained with the R code below might be virtually identical to the estimates you could obtain using the R code for logistic regression provided above. However, if you attempt 100 million sets you might have to wait a long time for the R code to run and your computer might crash before the 100 million sets have been simulated. Note that if you obtain 20 heads out of 100 flips in one set then the frequentist logistic regression code above will return p(heads) = 0.20 (SE = 0.04), while the R code below (with one million sets) returns p(heads) = 0.1999797 (SE = 0.03997182) which is identical if values are rounded to 4 decimal places.

set.seed(1234)

iters  <- 1000000

N      <- 100

p.heads <- 0.20

est.p <- rep(NA, length=iters)

for(i in 1:iters) {

C1 <- rbinom(N,1,p.heads)

est.p[i] <- sum(C1) / N

}

point.estimate.of.p.heads <- mean(est.p)

se.estimate.of.p.heads   <- sd(est.p)

point.estimate.of.p.heads

se.estimate.of.p.heads

Mark W. Miller (talk) 11:02, 11 July 2012 (UTC)

R Code for Formatting Dates
Suppose you have a data set that contains a separate column for month, day and year and you want to convert those three columns into one date column using Program R. The following R code does that and also presents two ways to do the reverse. The second of those two ways is based on suggestions I found from DWin on Stackoverflow here:

http://stackoverflow.com/questions/6550678/split-date-into-different-columns-for-year-month-and-day

http://stackoverflow.com/questions/6987478/convert-a-month-abbreviation-to-a-numeric-month-in-r

#Create the data set

D <- data.frame(c('November', 'December'), c(16, 31), c(2012,2012), c('A','B'), c(10,100), stringsAsFactors=F)

colnames(D) <- c('month', 'day', 'year', 'group', 'result')

D

#Convert to R date format

D2 <- paste(D[,1], D[,2], D[,3], sep=' ')

D3 <- data.frame(as.Date(D2, "%B %d %Y"), D[,4:5], stringsAsFactors=F)

colnames(D3) <- c('date', 'group', 'result')

D3

#Revert to original format using an inefficient approach

DD3 <- D3

DD3$date <- as.character(DD3$date)

DD3$date <- gsub('-11-', ' November ', DD3$date)

DD3$date <- gsub('-12-', ' December ', DD3$date)

DD3

DD4 <- strsplit(DD3$date, "\\ ")

DD5 <- unlist(DD4)

DD6 <- do.call(rbind, DD4)

colnames(DD6) <- c('year', 'month', 'day')

DD7 <- DD6[, c("month", "day", "year")]

DD7 <- as.data.frame(DD7)

DD7$month <- as.character(DD7$month)

DD7$day  <- as.numeric(as.character(DD7$day))

DD7$year <- as.numeric(as.character(DD7$year))

DD8 <- data.frame(DD7, DD3[,2:3])

DD8

identical(D, DD8)

#Revert to original format using a more efficient approach

DDD3 <- D3

dtstr <- as.character( DDD3[,1])

month <- as.numeric(as.character(sapply(strsplit(dtstr, "-"), "[", 2)))

day  <- as.numeric(as.character(sapply(strsplit(dtstr, "-"), "[", 3)))

year <- as.numeric(as.character(sapply(strsplit(dtstr, "-"), "[", 1)))

DDD4 <- data.frame(month, day, year, D[,4:5])

DDD4$month <- month.name[DDD4$month]

DDD4

identical(D, DDD4)

Mark W. Miller (talk) 23:34, 19 November 2012 (UTC)

Replace One Cell in a Data Frame in R
Below is R code showing how to replace one value or one cell in a data frame. This code works without knowing the row and column numbers of the cell you wish to replace. Note that with this example, if 'df$aa == 'D' & df$bb == 4' does not identify a unique cell in the data frame then any cell for which 'df$aa == 'D' & df$bb == 4' will be replaced with '25'. In the first example below 'df$aa == 'D' & df$bb == 4' does identify a unique cell in the data frame 'df'.

aa <- c('A','D','C','D','E')

bb <- c(1,2,4,4,5)

cc <- c(6:10)

df <- data.frame(aa,bb,cc)

df

df$bb[df$aa == 'D' & df$bb == 4] <- 25

df

str(df)

If you know the row and column numbers of the cell you wish to replace you could use the following. This will replace the value in that specific cell regardless of the value in that cell, even if the value in that cell is a missing observation 'NA'.

aa <- LETTERS[1:5]

bb <- c(1,4,3,4,5)

cc <- c(6:10)

df <- data.frame(aa,bb,cc)

df

df[4,2] = 25

df

str(df)

Mark W. Miller (talk) 02:01, 25 November 2012 (UTC)

Simple SURVIV Example
proc model npar=2 addcell ;

cohort = 10;

4  : s(1) * s(2)  ;

4  : s(1) * (1 - s(2)) ;

1  : (1 - s(1)) * s(2)  ;

labels;

s(1) = survival ;

s(2) = detection ;

proc estimate novar name=one;

initial;

s(1) = 0.50 ;

s(2) = 0.50 ;

proc stop;

Mark W. Miller (talk) 02:17, 4 December 2012 (UTC)

Cautionary Note About Subset in Program R
state = c('Ohio')

my.data = read.table(text = "

fruit state year grade units

apples Ohio 1990  aa    500

apples Ohio 1990  bb    600

apples Ohio 1990  cc    700

orange Ohio 1990  aa     80

orange Ohio 1990  bb     70

orange Ohio 1990  cc     60

apples Iowa 1990  rr    500

apples Iowa 1990  ss    600

apples Iowa 1990  tt    700

orange Iowa 1995  rr     10

orange Iowa 1995  ss     20

orange Iowa 1995  tt     30

", sep = "", header = TRUE, stringsAsFactors = FALSE)

# this does not work

my.data2 <- subset(my.data, my.data$state == state)

my.data2

# this works

my.data3 <- subset(my.data, my.data$state == 'Ohio')

my.data3

# this works

my.data4 <- my.data[my.data$state == state,]

my.data4

Mark W. Miller (talk) 08:07, 20 December 2012 (UTC)

R Code for ifelse statement
Below are several examples of R code for the ifelse statement. Two examples show how to use the ifelse statement with a vector. R code is also presented showing how the ifelse statement can replace a for-loop. I also include code for using the microbenchmark package to time the ifelse statement and compare its speed to that of the for-loop. To make that comparison I inserted the ifelse statement into a function and did the same for the for-loop. The ifelse statement was approximately six times faster than the for-loop. An ifelse statement can be used in this manner because it is vectorized, unlike the if statement.

x <- seq(30,80,5)

x

# [1] 30 35 40 45 50 55 60 65 70 75 80

ifelse(x > 50, (x + 20), x)

# [1]  30  35  40  45  50  75  80  85  90  95 100

df1 = read.table(text = "

AA   BB    CC     DD

1   10     2    0.01

5   15     4    0.02

10   20     6    0.03

21   25     8    0.04

21   30    10    0.05

30   35    12    0.06

30   40    14    0.07

40   45    16    0.08

40   50    18    0.09

50   55    20    0.10

50   60    22    0.11

75   65    24    0.12

75   70    26    0.13

75   75    28    0.14

80   80    30    0.15

80   85    32    0.16

90   90    34    0.17

90   95    36    0.18", sep = "", header = TRUE)

df1

df2 <- df1

df3 <- df1

df4 <- df1

df1$AA <- ifelse(df1$AA > 50, (df1$AA + 20), df1$AA)

df1

df2[,1] <- ifelse(df2[,1] > 50, (df2[,1] + 20), df2[,1])

df2

fun1 <- function(x) { AA <- ifelse(x[,1] > 50, (x[,1] + 20), x[,1])

return(data.frame(AA, x[,2:ncol(x)]))}

df33 <- fun1(df3)

fun2 <- function(y) {

for (i in seq(1:nrow(y))) {

if(y[i,1] > 50) y[i,1] = (y[i,1] + 20) else(y[i,1] = y[i,1])

}

AA <- y[,1]

return(data.frame(AA, y[,2:ncol(y)]))

}

df44 <- fun2(df4)

df1 == df2

df1 == df33

df1 == df44

library(microbenchmark)

microbenchmark(fun1(df3), fun2(df4), times=20)

# Unit: microseconds

#      expr      min       lq    median       uq      max

# 1 fun1(df3) 261.706  265.027  269.5545  271.819  505.302

# 2 fun2(df4) 1732.330 1761.760 1782.7395 1824.396 2490.885

Mark W. Miller (talk) 01:09, 8 January 2013 (UTC)

R Code to Replace Missing Observations with 0 in One Column of a Data Frame
Below is code to replace NA with 0 in R if you only want to replace the missing observations in one column of a data frame. A similar post exists on StackOverflow here: http://stackoverflow.com/questions/8161836/how-do-i-replace-na-values-with-zeros-in-r/15014847#15014847

df = read.table(text = "

state year grade   X1   X2   X3

Ohio 1990   aa      1    2    3

Ohio 1991   bb      4    5   NA

Ohio 1992   cc     NA    8    9

Ohio 1993   dd     10   NA   12

Ohio 1994   ee     13   14   15

Ohio 1995   ff     16   17   NA

", na.strings = "NA", header = TRUE, stringsAsFactors = FALSE)

df$X3[is.na(df$X3)] <- 0

df

Mark W. Miller (talk) 02:01, 22 February 2013 (UTC)

R Code to Select Rows in a Data Frame Based on Multiple Criteria
The R code below selects rows in a data frame where var1.a equals var1.b and var2.a equals var2.b and none of elements in those four columns are missing observations (or NA). I select those rows using a which statement and also using subsetting within brackets.

(When I copy and paste this code into R I get an error reading the data. That is because one space of white space is introduced in the empty rows in the data set as a result of including an empty line between rows so the data set appears reasonably formatted in Wikipedia.  In other words, if you copy the code and paste it into Notepad, remove the empty space in the empty rows in the data set, then paste the revised data set into R it should run.)

df = read.table(text = "

state county  year    var1.a  var2.a  var1.b  var2.b

1      1    1990    10       25      20      25

1      2    1990    20       15      20      15

2      1    1990    30       NA      40      25

2      2    1990    40       35      10      35

3      1    2000    20       45      10      NA

3      2    2000    20       55      20      55

4      1    2000    NA       65      NA      NA

4      2    2000    80       NA      30      NA ", sep = "", header = TRUE)

df[!is.na(df$var1.a) & !is.na(df$var1.b) & !is.na(df$var2.a) & !is.na(df$var2.b) & (df$var1.a == df$var1.b) & (df$var2.a == df$var2.b),]

which(((df$var1.a == df$var1.b) & (df$var2.a == df$var2.b)), arr.ind=TRUE)

df[which(((df$var1.a == df$var1.b) & (df$var2.a == df$var2.b)), arr.ind=TRUE),]

The answer is below where the 2 and 6 in the first column are row numbers returned by R:

state county year var1.a var2.a var1.b var2.b

2    1      2 1990     20     15     20     15

6    3      2 2000     20     55     20     55

Mark W. Miller (talk) 07:23, 18 March 2013 (UTC)

A Note About Running R Code Posted on this Site
The R code posted on this site should run. However, if you copy the R code from here and paste it into R you may get an error when reading data sets. That seems to be because white space is included in the empty rows I use to format data sets for Wikipedia. You can copy the R code from here, paste it into Notepad, remove the empty rows in the data sets, paste that revised code into R and it should run.

Mark W. Miller (talk) 07:42, 18 March 2013 (UTC)

Quit, Close, Terminate, End, Interrupt or Stop Program R
Usually if I want to quit Program R I click on the outer-most white X on a red background in the upper right corner of the R window. Alternatively I might type q or quit at the cursor and press Enter. However, sometimes I want to close Program R while a program, code or script is still running. In that case the above approaches might not work. In those instances, I have found that I can stop the code by using the mouse to press the 'Misc' tab at the top of the R window and then clicking either 'Stop all computations' (which seems to stop the current computation) or 'Stop current computation' (which seems to stop all computations). I might have to press 'Alt M' before the computer will allow me to click the 'Misc' tab.

See also here:

http://stackoverflow.com/questions/8370548/how-can-i-interrupt-a-running-code-in-r-with-a-keyboard-command

Mark W. Miller (talk) 05:52, 28 May 2013 (UTC)

Regular Expression to Replace First Character in String or Replace Last Character in String with Program R
Below is R code for regular expressions to replace the first N characters in a string or to replace the last N characters in a string. The line nn = 3 means that I am replacing three characters from a string. You can change nn to whatever number you wish. I am replacing each of nn characters with a '0'. You can replace characters with whatever symbol you wish. For example, to replace each character with a question mark symbol change rep(0, nn) to rep('?', nn).

Note that as described above, data sets posted here should be pasted into Notepad or some other editor to remove formatting before they can be read by Program R. The R code that follows the data set df.1 should run by simply copying the code from this webpage and pasting it directly into Program R.

What I have posted here is a generalization of an answer by dickoa to an earlier question of mine on Stackoverflow: http://stackoverflow.com/questions/16657366/replace-the-first-n-dots-of-a-string-revisited

df.1 <- read.table(text = '

my.string other.stuff

11111      10

.....      20

12345      30

abcde      40

zm234      50

.y900      60

qqqq. 70

', header = TRUE)

df.5 = df.4 = df.3 = df.2 = df.1

nn = 3

# Replace the first three characters of my.string:

df.2$my.string <- sub(sprintf("^.{%s}", nn), paste(rep(0, nn), collapse = ""), df.2$my.string)

# Replace the first three characters of my.string if they are a letter, number or dot:

df.3$my.string <- sub(sprintf("^[a-zA-Z0-9.]{%s}", nn), paste(rep(0, nn), collapse = ""), df.3$my.string)

# Replace the last character of my.string with three 0's:

df.4$my.string <- sub(sprintf(".${%s}", nn), paste(rep(0, nn), collapse = ""), df.4$my.string)

# Replace the last three characters of my.string with 0:

df.5$my.string <- sub(sprintf("...${%s}", nn), paste(rep(0, nn), collapse = ""), df.5$my.string)

Mark W. Miller (talk) 11:32, 6 June 2013 (UTC)

Easy Entry of Individual Covariates into Program MARKs Design Matrix
The following procedure can be used to enter individual covariates into the design matrix in Program MARK. This method may be helpful when the number of individual covariates to enter is large. Enter the individual covariates into Program Excel. Copy one column at a time from Excel. Position the mouse cursor in the cell of the design matrix of Program MARK where you want the first individual covariate in the column to go. Right click on the mouse. Select ‘Paste Clipboard’.

You can copy multiple columns at the same time from the Excel file if you wish. In that case position the mouse cursor in the upper-left-most cell of the design matrix where you wish that block of individual covariates to be positioned (for example, Cell 12 of Column B8:). Then right click on the mouse. Select ‘Paste Clipboard’.

Mark W. Miller (talk) 22:45, 14 June 2013 (UTC)

Easy Naming of Individual Covariates in Program MARK
Naming individual covariates in Program MARK can be time consuming and prone to typographic errors when creating a new file and the number of individual covariates is large. You can create a column of names in Excel then copy that column to the clipboard. Next position the mouse cursor in the cell for the name of the first individual covariate in Program MARK and click 'paste'. This will paste in the names of all the individual covariates.

This is different from, but similar to, entering names of covariates into the design matrix described above.

Mark W. Miller (talk) 23:09, 23 May 2014 (UTC)

Identify Position of First 1 by Row in R
my.data <- read.table(text = '

x1 x2  x3  x4  x5

0  1   1   2   0

0  0   1  NA   2

1 NA   1   2   0

NA NA  NA  NA  NA

NA  0   0   1   1

0  0   1   0   0

0  0   1  NA  NA

', header = TRUE, stringsAsFactors = FALSE, na.strings = 'NA')

as.numeric(t(apply((my.data==1), 1, which.max)))

[1] 2  3  1 NA  4  3  3

--Mark W. Miller (talk) 10:26, 28 July 2015 (UTC)

rdiscrete in Stata or Mata
clear

cd C:\Users\mark.w.miller\Documents\simple_Stata_files

global wave_of_cy   =  3

global wave_numtrips = 34

set obs $wave_numtrips

local wave_obs = _N

clear

use "fakedata_July1.dta", replace

putmata mydata=(napples pcount1 pcount2 pcount3), replace

mata:

`wave_obs'

"wave_of_cy"

$wave_of_cy

p = mydata[.,$wave_of_cy+1]

pp = p :/ sum(p)

rdiscrete(`wave_obs', 1, pp)

end

Mark W. Miller (talk) 15:03, 11 July 2019 (UTC)

Create Data Set in Stata and Load Data Set into Mata
clear

cd C:\Users\mark.w.miller\Documents\simple_Stata_files

input napples pcount1 pcount2 pcount3 pcount4

0 . .33263497 .22213458 .483706

1 . .18085447 .0592791 .07308878

2 . .09918489 .17381628 .06245227

3 . .12741563 .12649602 .04291577

4 . .07782318 .05691737 .05342419

5 . .03847268 .03891907 .04361086

6 . .02907396 .03870039 .03463672

7 . .00939872 .02071542 .02554072

8 . .01879743 .01602958 .01832358

9 . .01923634 .0242933 .01647887

10 . .00983762 .01197392 .03653411

11 . .01967525 .03063997 .01564033

12 . 0 .02302153 .02067067

13 . .01879743 .0234988 .01317734

14 . .00939872 .02846567 .00248811

15 . 0 .00786653 .01114919

16 . .00939872 .01106286 .00873202

17 . 0 .00353461 .00346993

18 . 0 .01751844 .00989827

19 . 0 .00731278 .00044811

20 . 0 .0041034 .00288536

21 . 0 .00064227 .00133846

22 . 0 .01064502 .00661907

23 . 0 .00933989 .0065631

24 . 0 .00209015 .00099294

25 . 0 .00236378 .00276476

26 . 0 .00709134 0

27 . 0 0 0

28 . 0 .00497991 0

29 . 0 0 0

30 . 0 .00268613 .00188223

31 . 0 .00685134 0

32 . 0 0 .00056822

33 . 0 .00701054 0

end

list

save fakedata_July1

clear

use "fakedata_July1.dta", replace

putmata mydata=(napples pcount1 pcount2 pcount3 pcount4), replace

mata : mydata

Mark W. Miller (talk) 15:08, 11 July 2019 (UTC)