//
archives

Archive for

Conjoint Model


Conjoint analysis indicates consumer preferences for products with multiple characteristics, wherein these characteristics vary among several categories. For example, the researcher might want to learn consumer preferences for a coffee maker with three characteristics: price (with three levels), number of cups brewed (with three levels), and timed start (yes or no). The task is to determine which of the 3x3x2 = 12 combinations of characteristics is most preferred by consumers.

The Conjoint Model

Conjoint analysis is based on a main effects analysis-of-variance model. Data are collected by asking subjects about their preferences for hypothetical products defined by attribute combinations. Conjoint analysis decomposes the judgment data into components, based on qualitative attributes of the products. A numerical utility or part-worth utility value is computed for each level of each attribute. Large utilities are assigned to the most preferred levels, and small utilities are assigned to the least preferred levels. The attributes with the largest utility range are considered the most important in predicting preference. Conjoint analysis is a statistical model with an error term and a loss function. Metric conjoint analysis models the judgments directly. When all of the attributes are nominal, the metric conjoint analysis is a simple main-effects ANOVA with some specialized
output. The attributes are the independent variables, the judgments comprise the dependent variable, and the utilities are the parameter estimates from the ANOVA model. The following is a metric conjoint analysis model for three factors.
Conjoin3
This model could be used, for example, to investigate preferences for cars that differ on three attributes: mileage, expected reliability, and price. Yijk is one subject’s stated preference for a car with the ith level of mileage, the jth level of expected reliability, and the k th level of price. The grand mean is , and the error is ijk. Nonmetric conjoint analysis finds a monotonic transformation of the preference judgments.
The model, which follows directly from conjoint measurement, iteratively fits the ANOVA model until the transformation stabilizes. The R2 increases during every iteration until convergence, when the change in R2 is essentially zero. The following is a metric conjoint analysis model for three factors. 

Conjoin
Conjoin1
The R2 for a nonmetric conjoint analysis model will always be greater than or equal to the R2 from a metric analysis of the same data. The smaller R2 in metric conjoint analysis is not necessarily a disadvantage, since results should be more stable and reproducible with the metric model. Metric conjoint analysis was derived from nonmetric conjoint analysis as a special case. Today, metric conjoint analysis is used more often than nonmetric conjoint analysis. In the SAS System, conjoint analysis is performed with the SAS/STAT procedure TRANSREG (transformation regression). Metric conjoint analysis models are fit using
ordinary least squares, and nonmetric conjoint analysis models are fit using an alternating least squares algorithm.

SAS Program Statements


OPTIONS PAGENO=1 PAGESIZE=56 NOLABEL;
*
* Define a data set named TIRES.
* The variable RANK typically would be the average across all subjects.
*;
DATA TIRES;
INPUT BRAND 1 PRICE 3 LIFE 5 HAZARD 7 RANK 9-10;
CARDS;
1 1 2 1 3
1 1 3 2 2
1 2 1 2 14
1 2 2 2 10
1 3 1 1 17
1 3 3 1 12
2 1 1 2 7
2 1 3 2 1
2 2 1 1 8
2 2 3 1 5
2 3 2 1 13
2 3 2 2 16
3 1 1 1 6
3 1 2 1 4
3 2 2 2 15
3 2 3 1 9
3 3 1 2 18
3 3 3 2 11
;
*
* Set up value labels.
*;
PROC FORMAT;
VALUE BRANDF
1 = ‘GOODSTONE’
2 = ‘PIROGI ‘
3 = ‘MACHISMO ‘;
VALUE PRICEF
1 = ‘$69.99’
2 = ‘$74.99’
3 = ‘$79.99’;
VALUE LIFEF
1 = ‘50,000’
2 = ‘60,000’
3 = ‘70,000’;
VALUE HAZARDF
1 = ‘YES’
2 = ‘NO ‘;
PROC FREQ NOPRINT;
FORMAT BRAND BRANDF. PRICE PRICEF. LIFE LIFEF. HAZARD HAZARDF.;
*
* Conduct nonmetric (i.e., simple) conjoint analysis.
*;
PROC TRANSREG MAXITER=50 UTILITIES SHORT;
ODS SELECT TESTSNOTE COVERGENCESTATUS FITSTATISTICS UTILITIES;
MODEL MONOTONE(RANK / REFLECT) = CLASS(BRAND PRICE LIFE HAZARD / ZERO=SUM);
OUTPUT IREPLACE PREDICTED;
*;
PROC PRINT LABEL;
VAR RANK TRANK PRANK BRAND PRICE LIFE HAZARD;
LABEL PRANK = ‘PREDICTED RANKS’;
*
* Conduct metric conjoint analysis using the %mktex SAS macro.
* The parentheses after the %MKTEX macro defines:
* The number of categories for each variable.
* The number of combinations being evaluated.
* Seed= [some number] is not strictly necessary, but helps ensure a
reproducible design.
*;
%MKTEX(3 3 3 2, N=18, SEED=448)
%MKTLAB(VARS = BRAND PRICE LIFE HAZARD, OUT=SASUSER.TIREDESIGN,
STATEMENTS = FORMAT BRAND BRANDF. PRICE PRICEF. LIFE LIFEF. HAZARD HAZARDF.)
%MKTEVAL;
PROC PRINT DATA=SASUSER.TIREDESIGN;
RUN;

Advertisements

Conjoint Model


Conjoint analysis indicates consumer preferences for products with multiple characteristics, wherein these characteristics vary among several categories. For example, the researcher might want to learn consumer preferences for a coffee maker with three characteristics: price (with three levels), number of cups brewed (with three levels), and timed start (yes or no). The task is to determine which of the 3x3x2 = 12 combinations of characteristics is most preferred by consumers.

The Conjoint Model

Conjoint analysis is based on a main effects analysis-of-variance model. Data are collected by asking subjects about their preferences for hypothetical products defined by attribute combinations. Conjoint analysis decomposes the judgment data into components, based on qualitative attributes of the products. A numerical utility or part-worth utility value is computed for each level of each attribute. Large utilities are assigned to the most preferred levels, and small utilities are assigned to the least preferred levels. The attributes with the largest utility range are considered the most important in predicting preference. Conjoint analysis is a statistical model with an error term and a loss function. Metric conjoint analysis models the judgments directly. When all of the attributes are nominal, the metric conjoint analysis is a simple main-effects ANOVA with some specialized
output. The attributes are the independent variables, the judgments comprise the dependent variable, and the utilities are the parameter estimates from the ANOVA model. The following is a metric conjoint analysis model for three factors.
Conjoin3
This model could be used, for example, to investigate preferences for cars that differ on three attributes: mileage, expected reliability, and price. Yijk is one subject’s stated preference for a car with the ith level of mileage, the jth level of expected reliability, and the k th level of price. The grand mean is , and the error is ijk. Nonmetric conjoint analysis finds a monotonic transformation of the preference judgments.
The model, which follows directly from conjoint measurement, iteratively fits the ANOVA model until the transformation stabilizes. The R2 increases during every iteration until convergence, when the change in R2 is essentially zero. The following is a metric conjoint analysis model for three factors. 

Conjoin
Conjoin1
The R2 for a nonmetric conjoint analysis model will always be greater than or equal to the R2 from a metric analysis of the same data. The smaller R2 in metric conjoint analysis is not necessarily a disadvantage, since results should be more stable and reproducible with the metric model. Metric conjoint analysis was derived from nonmetric conjoint analysis as a special case. Today, metric conjoint analysis is used more often than nonmetric conjoint analysis. In the SAS System, conjoint analysis is performed with the SAS/STAT procedure TRANSREG (transformation regression). Metric conjoint analysis models are fit using
ordinary least squares, and nonmetric conjoint analysis models are fit using an alternating least squares algorithm.

SAS Program Statements


OPTIONS PAGENO=1 PAGESIZE=56 NOLABEL;
*
* Define a data set named TIRES.
* The variable RANK typically would be the average across all subjects.
*;
DATA TIRES;
INPUT BRAND 1 PRICE 3 LIFE 5 HAZARD 7 RANK 9-10;
CARDS;
1 1 2 1 3
1 1 3 2 2
1 2 1 2 14
1 2 2 2 10
1 3 1 1 17
1 3 3 1 12
2 1 1 2 7
2 1 3 2 1
2 2 1 1 8
2 2 3 1 5
2 3 2 1 13
2 3 2 2 16
3 1 1 1 6
3 1 2 1 4
3 2 2 2 15
3 2 3 1 9
3 3 1 2 18
3 3 3 2 11
;
*
* Set up value labels.
*;
PROC FORMAT;
VALUE BRANDF
1 = ‘GOODSTONE’
2 = ‘PIROGI ‘
3 = ‘MACHISMO ‘;
VALUE PRICEF
1 = ‘$69.99’
2 = ‘$74.99’
3 = ‘$79.99’;
VALUE LIFEF
1 = ‘50,000’
2 = ‘60,000’
3 = ‘70,000’;
VALUE HAZARDF
1 = ‘YES’
2 = ‘NO ‘;
PROC FREQ NOPRINT;
FORMAT BRAND BRANDF. PRICE PRICEF. LIFE LIFEF. HAZARD HAZARDF.;
*
* Conduct nonmetric (i.e., simple) conjoint analysis.
*;
PROC TRANSREG MAXITER=50 UTILITIES SHORT;
ODS SELECT TESTSNOTE COVERGENCESTATUS FITSTATISTICS UTILITIES;
MODEL MONOTONE(RANK / REFLECT) = CLASS(BRAND PRICE LIFE HAZARD / ZERO=SUM);
OUTPUT IREPLACE PREDICTED;
*;
PROC PRINT LABEL;
VAR RANK TRANK PRANK BRAND PRICE LIFE HAZARD;
LABEL PRANK = ‘PREDICTED RANKS’;
*
* Conduct metric conjoint analysis using the %mktex SAS macro.
* The parentheses after the %MKTEX macro defines:
* The number of categories for each variable.
* The number of combinations being evaluated.
* Seed= [some number] is not strictly necessary, but helps ensure a
reproducible design.
*;
%MKTEX(3 3 3 2, N=18, SEED=448)
%MKTLAB(VARS = BRAND PRICE LIFE HAZARD, OUT=SASUSER.TIREDESIGN,
STATEMENTS = FORMAT BRAND BRANDF. PRICE PRICEF. LIFE LIFEF. HAZARD HAZARDF.)
%MKTEVAL;
PROC PRINT DATA=SASUSER.TIREDESIGN;
RUN;

PROC TRANSREG


When using PROC TRANSREG, what are the defaults with pspline?

Proc transreg performs transformation regression in which both the outcome and predictor(s) can be transformed and splines can be fit. Psplines are piecewise polynomials that can be used to estimate relationships that are difficult to fit with a single function. 

In this page, we will walk through an example proc transreg with the pspline option and explore its defaults.  The bspline, spline, and pspline options, when similarly specified, yield the same results.  Their differences lie in the number and type of transformed variables generated for estimation. 

We can begin by creating a dataset with an outcome Y and a predictor X. This example data is generated in the SAS examples for proc transreg


data a;
  x=-0.000001;
  do i=0 to 199;
    if mod(i,50)=0 then do;
      c=((x/2)-5)**2;
      if i=150 then c=c+5;
      y=c;
      end;
    x=x+0.1;
    y=y-sin(x-c);
	output;
    end;
run;

proc gplot data = a;
  plot y*x;
run;

Clearly, there is not a single, continuous function relating Y to X.  The relationship does not appear random, but it does appear to change with X.  Thus it makes sense to try to fit this with splines.  Before running the proc transreg, we can see that our data contains four variables:

proc print data = a (obs = 5); run;

Obs       X       I       C          Y
  1    0.10000    0    25.0000    24.7694
  2    0.20000    1    25.0000    24.4427
  3    0.30000    2    25.0000    24.0234
  4    0.40000    3    25.0000    23.5155
  5    0.50000    4    25.0000    22.9241

In the proc transreg command, we indicate in the model line that we wish to predict variable y without transformation with identity(y). If we wished to model a transformed version of y (the log or rank of y, for example), we would indicate the transformation here.  To predict y, we indicate that we wish to use piecewise polynomial functions of x with pspline(x). We also opted to output a dataset, a2, containing predicted values from the model.

proc transreg data=a;
   model identity(y) = pspline(x);
   output out = a2 predicted;
run;

The TRANSREG Procedure

     TRANSREG Univariate Algorithm Iteration History for Identity(Y)
Iteration    Average    Maximum                Criterion
   Number     Change     Change    R-Square       Change    Note
-------------------------------------------------------------------------
        1    0.00000    0.00000     0.46884                 Converged

We can see in the outcome above that the model converged and has an R-squared value of 0.47.  Let’s look at the dataset output by proc transreg.

proc print data = a2 (obs = 5); run;
Obs  _TYPE_  _NAME_     Y        TY       PY    Intercept    X_1      X_2

  1  SCORE    ROW1   24.7694  24.7694  24.1144      1      0.10000  0.01000
  2  SCORE    ROW2   24.4427  24.4427  23.4722      1      0.20000  0.04000
  3  SCORE    ROW3   24.0234  24.0234  22.8424      1      0.30000  0.09000
  4  SCORE    ROW4   23.5155  23.5155  22.2249      1      0.40000  0.16000
  5  SCORE    ROW5   22.9241  22.9241  21.6195      1      0.50000  0.25000
  
  Obs    X_3      TIntercept      TX_1       TX_2       TX_3        X

  1  0.00100         1        0.10000    0.01000    0.00100    0.10000
  2  0.00800         1        0.20000    0.04000    0.00800    0.20000
  3  0.02700         1        0.30000    0.09000    0.02700    0.30000
  4  0.06400         1        0.40000    0.16000    0.06400    0.40000
  5  0.12500         1        0.50000    0.25000    0.12500    0.50000

In addition to adding the predicted values, py, to the dataset, we can see that a new variable, ty,  has been added for the “transformed” value of y (since our transformation was the identity, these values are the same as y); three variables (x_1, x_2, x_3) that are the powers of x have been added.  Transformations of these three variables and the intercept are also included and indicated with a ‘t‘.  We can see that, by default, SAS fits a single third-degree polynomial in x to y.  Note that though splines are often used to fit piecewise functions, the default setting when using pspline in proc transreg is to estimate just one function (zero knots). 

We can plot the predicted values to see how closely they match the original data. 

legend label=none value=('y' 'predicted y') position=(bottom left inside) mode=share down = 2; 
proc gplot data = a2;
   plot (y py)*x / overlay legend = legend;
run;

For this simple example, we could achieve the same result by running an ordinary least squares regression after transforming x in the same manner as proc transreg.


data a3; set a;
  x2 = x*x;
  x3 = x*x*x;
run;

proc reg data = a3;
  model y = x x2 x3;
run;
The REG Procedure
Model: MODEL1
Dependent Variable: Y

Number of Observations Read         200
Number of Observations Used         200

                             Analysis of Variance
                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F
Model                     3     7955.26078     2651.75359      57.67    <.0001
Error                   196     9012.65604       45.98294
Corrected Total         199          16968


Root MSE              6.78107    R-Square     0.4688
Dependent Mean       12.04335    Adj R-Sq     0.4607
Coeff Var            56.30551


                        Parameter Estimates
                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|
Intercept     1       24.76908        1.95451      12.67      <.0001
X             1       -6.60903        0.84002      -7.87      <.0001
x2            1        0.62721        0.09698       6.47      <.0001
x3            1       -0.01513        0.00317      -4.77      <.0001

In this example, using proc transreg only saves us the step of generating variables. However, we may wish to fit more than one function in a piecewise regression or use more complicated transformations of x.  Doing so with data and proc reg steps quickly becomes unmanageable or impossible, while doing so with proc transreg is effective and efficient.

R BASICS


Basic Commands

  • quit(); q()
  • help(command); help.start()
  • search(); help.search()
  • dir(); methods()
  • library(p); identify(); attach(); detatch()
  • remove(); rm()
  • start:end; c(); rep(); seq()
  • scan(); print(); str(); ls()
  • cat(); cat(“concaternate”, c, “and print”, “\t”)
  • options(prompt=’.’, continue=”///”, digits=10); getOption(“width”)
  • source(); source.url() /* run commands in a file */

Simple examples

library() # list packages available
library(car) # load a package
list(data()) # list data sets in the current package
summary(Davis)
list(Davis)
list(Davis$weight)
stem(Davis[,2]) # equal to stem(Davis$weight)
stem(Davis$height, scale=4)
boxplot(Davis$weight)
w<-Davis$weight
h<-Davis$height
plot(w ~ h)
cor(Davis[,c(2:3)])
cor.test(w,h)
t.test(Davis[,2], mu=65)
t.test(Davis$height, Davis$weight, mu=100, paired=FALSE)
var.test(Davis$height, Davis$weight)

d<=read.csv(“c:/temp/R/nes.csv”, header=TRUE) list(names(d)) # list variable names

OPERATOR/FUNCTION

Operators

  • <- (left assignment), -> (right assignment)
  • +, -, *, /, ^, %% (modulus)
  • >, >=, <, <=, == (equal), != (not equal)
  • & (and), | (or)
  • %*% (matrix product); %/% (division)
  • %o% (Outer product); %x% (Kronecker product)
  • %in% (Matching operator);

Functions

  • abs(); sin(); cos(); tan(); exp(); sqrt(); min(); max()
  • log(); log(v,10); log10(); log2(); log(v, base=10)
  • mean(); sum(); median(); range(); var(); sd()
  • rank(); ave(v, group); by(group)
  • c(a, b, c); c(start:end); seq(start:end); seq(10, 100, by=5)
  • rep(n, time); rep(7, 3); rep(start:end, time)
  • rep(1:3, c(2,2,2)); rep(1:3, each=2); rep(1:3, c(1:3))
  • seq(1,4); seq(1,10, by=2); seq(0,1, length=10)
  • length(), sort(), order(); rev(v) ## to reverse
  • dnorm(1.96); dt(1.96, 100); df(1.96, 1, 100); dchisq(1.96, 10)
  • pnorm(1.96); pt(1.96, 100); pf(1.96, 1, 100); pchisq(1.96, 10)
  • rpois(n, lamda); rnorm(n); rt(n, df); rt(n, df=c(1:10)); rexp(n)
  • substring(s, start, stop); substr(s, start, stop); nchar(s)
  • date()
  • mode() ## type of object

Up

INPUT OUTPUT

Reading Text Files

  • source(f); /* to execute commands in the file */
  • read.table(f); read.table.url(url)
  • download.file(url); url.show(url)
  • m<-read.table(“f:/temp/cigar.txt”, header=TRUE)
  • m<-read.table(‘f:/temp/cigar.txt’)
  • names(m)<-c(“a”, “b”, “c”)
  • read.csv(f, header=TRUE, sep=”,”, quote=”\””, dec=”.”)
  • read.csv2(f, header=TRUE, sep=”;”, quote=”\””, dec=”,”)
  • read.delim(f, header=TRUE, sep=”\t”, quote=”\””, dec=”.”)
  • read.delim2(f, header=TRUE, sep=”\t”, quote=”\””, dec=”,”)
  • m<-read.csv(“nes2.csv, header=TRUE)
  • read.fwf(file, widths=c(3,5,3), header=”FALSE, sep=””, as.is=FALSE)
  • as.is=TRUE; as.is=T # not to be converted into a factor
  • na.strings<-c(“.”, “NA”, “”, “#”) # characters for missing
  • cnt=count.fields(df); which(cnt=7);

Reading Data Frames

  • load(d);
  • data(d); data(d, package=”p”)
  • data.frame(v1, v2) /* to make a data frame out of vectors */
  • m3<-data.frame(as.matrix(m[,2:4]))
  • m2<-edit(m); m2<-edit(data.frame(m))
  • data.entry(df)

Handling Data

  • m2<-match(v1, v2, nomatch=0) # data merging
  • m2<-match(m[,1], m[,3])

Writing Data

  • cat(); print()
  • cat(“y x1 x2”, “2 4 2”, “5 2 7″, file=”sample.txt”, sep=”\n”)
  • write.(obj, f)
  • write.table(df, file=’firms.csv’, sep=”,”, row.names=NA, col.names=NA)
  • save(f, obj); save.image(f)
  • sink(); format()

Up

MATRICES

Defining Matrices

  • m<-c(1, 2, 3, 4); c(1, 2, 3, 4)->m; assign(“m”, c(1, 2, 3, 4))
  • m<-data.frame(column1=c(1,2,3), column2=c(4,5,6)); ## 2 by 3
  • rep(c(1,2,3), 2); rep(c(1,2,3), each=2);
  • rep(c(1,2,3), c(2,2,2,)); m<-c(c1=15, c2=54, c3=50)
  • seq(1,4); seq(1,10, by=2); seq(0,1, length=10);
  • intm<-1:4; intm<-numeric(); intm[1]m<-1; intm[2]m>-2
  • strm<c(“a”, “b”, “c”); strm<-charanter(); strm[1]m<-“a”; strm[2]m<-“b”
  • blm<-c(T,F); blm<-v1>10; ## a boolean vector of TRUE and FALSE
  • m<-scan()
  • mm<-matrix(1:12,4); mm<-matrix(1:12, nrow=4)
  • mm<-matrix(1:12, ncol=3); mm<-matrix(1:12, nrow=4)
  • mm<-matrix(1:12, nrow=4, ncol=3); mm<-matrix(1:12, 4, 3)
  • arrm<-array(1:10); arrm<-array(1:10, dim=c(2,5))
  • cbind(); rbind(); gl(); expand.grid()
  • list()

Referring Matrices

  • m[,2]; v=m[2,]; m[-1, -3] ## to extract elements
  • m[c(1, 5, 6)]; m2=m[-c(1, 5, 6)] ## to extract elements
  • m<-c(c1=15, c2=54, c3=50); m<-c(“c1”, “c3”)
  • m2<-m$c2; m2<-m[,2]; m2<-m[,”c2″]; m2<-m[[2]]
  • m[,3:5]; m3<-m[,c(3, 4, 5)]; m3<-m[,c(“c3”, “c4”, “c5”)]
  • m<-c(4, 2, 4); names(m)<-c(“Grape”, “Pear”, “Apple”)
  • m1$v2 /*variable 2 of the data frame 1*/
  • white(); which.max(); which(min)
  • attr(m, which); attributes(obj)

Matrix Functions

  • t(); det(); rank(); eigen(); diag(); prod(); crossprod()
  • sum(); mean(); var(); sd(); min(); max(); prod(); cumsum(); cumprod()
  • is.na(m) ## to check if m contains a missing value
  • rowsum(); colsum(); nrow(); ccol()
  • dim(m); dimnames(m)
  • merge(df1, df2)
  • as.factor(); as.matrix(), as.vector(); /* conversion*/
  • is.factor(); is.matrix(), is.vector();
  • class(); unclass()
  • na.omit(); na.fail(); unique(); table(); sample()
  • as.array(); as.data.frame()
  • as.numeric(); as.characters(); as.logical(); as.complex()

Up

REGRESSION

Ordinary Least Squares (OLS)

  • lm(); glm()
  • m.ols<-lm(v1~v2+v3, data=m) ## linear model
  • lm(v1~v2+v3, data=m); summary(lm(v1~v2+v3, data=m)); summary(m.ols)
  • names(m.ols); coef(m.ols); fitted(m.ols); resid(m.ols)
  • predict(fit); AIC(fit); logLik(fit); deviance(fit)
  • model.matrix(v1~v2+v3, data=m)
  • m.ols2<-model.matrix(v1~v2+v3, data=m); summary(m.ols2)

Binary Response Regressions

  • m.logit<-glm(v1~v2+v3,family=binomial(link=logit),data=m)
  • summary(m.logit); coef(m.logit); fitted(m.logit); resid(m.logit)
  • lsfit(v1,v2)
  • nls(); m.nonlin<-lm(v1~v2+v2^2, data=m)
  • anova(m.ols, m.nonlin)
  • m.qr<-qr(m) ## QR Decomposition of a Matrix

Up

STATISTICS

Descriptives

  • summary(m); fivenum(m)
  • stem(v); boxplot(v); boxplot(v1, v2); hist(v)
  • qqnorm(v); qqline(v)
  • rug(); lines()
  • table() /*to make a table*/
  • tabulate()

Multivariate Analysis

  • cor(m); cor(sqrt(m)) ## Pearson correlation
  • cor.test(v1, v2)
  • prcomp() /* Principal components in the mva package*/
  • kmeans() /* Kmeans cluster analysis in the mva package*/
  • factanal() /* Factor analysis in the mva package*/
  • cancor() /* Canonical correlation in the mva package*/

Categorical Data Analysis

  • chisq.test(v1,v2) ## Pearson Chi-squared Test
  • fisher.test(v1,v2) ## Fisher Exact Test
  • friedman.test(v1,v2) ## Friedman Test
  • prop.test(); binom.test() ## sign test
  • kruskal.test(v1,v2) ## Kruskal-Wallis Rank Sum Test
  • wilcox.test(v1,v2) ## Wilcoxon Rank Sum (Mann-Whitney) Test
  • ks.test(v1,v2) ## Two Sample Kolmogorov-Smirnov Test
  • bartlett.test(v1,v2) ## Bartlett Test for Homogeneity of Variances

Up

T-TEST AND ANOVA (pdf)

  • t.test(v1,v2); t.test(v1,v2, var.equal=FALSE)
  • t.test(v1,v2, mu=0 paired=FALSE)
  • t.test(v1.v2, mu=10, paired=F, var.equal=T)
  • power.t.test(v1,v2); pairwise.t.test()
  • var.test(v1,v2) ## F test for equal variance
  • m.anova<-aov(v1~v2+v3, data=m)
  • aov(); anova()
  • summary(m.anova)
  • power.anova.test() ## Power calculations for balanced one-way ANOVA tests

Up

PROGRAMMING

Modules

frame_name<-function(arguments) {…}
mile.to.km<-function(mile) {mile*8/5}
km<-mile.to.km(c(35, 55, 75))

Flow Control

if (condition) {…} else if (condition) {…} else {…}
while (condition ) {…} # {} may be omitted for a single line expression
for (index in start:end) {…}
for (i in 1t:100) {sum <- sum + i}
repeat {…}
switch (statement, list)

Programming Functions

  • expression(); parse(); deparse(); eval()
  • optim() /* general-purpose optimization */
  • nlm() /* Newton algorithm */
  • lm() /* linear models */
  • nls() /* nonlinear least squares model */

Up

GRAPHICS

Plotting

  • plot(y~x, data=m, pch=16) # plotting character (pch)
  • pairs(m) # scatterplot matrix
  • xyrange<-range(m) # to get range of m
  • plot(y~x, data=m, xlim=xyrange, ylim=xyrange)
  • abline(0,1)
  • plot((0:10), sin((1:10)*pi, type=”1″) # 1 joins the points
  • barplot(); boxplot(); stem(); hist();
  • matplot() /* matrix plot */
  • pairs(m) /* scatterplots */
  • coplot() /* conditional plot */
  • stripplot() /* strip plot */
  • qqplot(); qqnorm(); qqline() /* quantile0quantile plot */

Options

  • points() # to add points to a plot
  • lines() # to add lines
  • text() # to add texts
  • mtext() # to add margin texts
  • axis() # to control axis
  • par(cex=1.25 mex=1.25)
  • par(mfrow=c(2,2), mfcol=c(1,1))

Regards,

SAS India

SAS Day 3


INPUT OVERVIEW

The INPUT statement describes the arrangement of a target data to be read in a DATA step. You need to provide variable names followed by $ (indicating a character value), pointer control, column-specifications, informat, and/or line hold specifiers (i.e., @, and @@) in an INPUT statement.

  • Column pointer controls such as @n and +n move the input pointer to a specified column in the input buffer.
  • Line pointer controls such as #n and / move the input pointer to a specified line in the input buffer.
  • Column specifications specify the columns of the input record that contain the value to read.
  • A informat is an instruction that SAS uses to read data into variables.
  • @, a single trailing @, holds an input record for the execution of the next INPUT statement within the same iteration of the DATA step. Thus, the next INPUT statement reads from the same record (line).
  • @@, a double trailing @, holds the input record for the execution of the next INPUT statement across iterations of the DATA step. Thus, the INPUT statement for the next iteration of the DATA step continues to read the same record (line).

The DATALINES statement (replacing the old CARDS statement) indicates that data lines follow in a DATA step. In order to read external data files, you have to use the INFILE statement.

There are six input styles used in the INPUT statement: list input, column input, formatted input, modified list input, named input, and mixed input. The following table summarizes features of four major styles.

input method

Which input style is the best? It depends on your skills and characteristics of data sets. If your data set has just a few observations with several variables, the list input or the named input will be better than the column input or the formatted input. When data elements are not separated with a blank or other delimiters, you cannot use the list input style. When data are well arranged, the column input or formatted input will be better than the list input. Therefore, you need to examine the data structure carefully when deciding the best input style. Of course, you must take this issue into account from the data coding stage.

LIST INPUT

The input style simply lists variables separated with a blank. This style is also called the free format.

DATA listed;
INPUT name $ id score;
DATALINESS /*–1—-+—-2—*/;
Park 8740031 87.5
Hwang . 94.3

RUN;

A character variable should be followed by $. A missing value should be marked with a period (.); a blank does not mean a missing value in this input style. Do not use more than one “.” for a missing value. The maximum length of a string variable is 8 characters (standard); that is, fixed 8bytes of memory are assigned to each variable. Therefore, a string longer than 8 characters will be trimmed. If you want to read a string longer than 8 characters, use LENGTH, INFORMAT, or ATTRIB statements. Or you may use different input styles such as column input or formatted input.

DATA _NULL_;
LENGTH analysis $15.;
INFORMAT year MMDDYY10.;
INPUT analysis year;
FORMAT year DATE9.;
CARDS /*–1—-+—-2—*/;
Regression 1/2/2002
ANOVA 05/05/2007
Time-Series 09/03/1968
RUN;
/* Output
Regression 01OCT2000
ANOVA 05MAY2004
Time-Series 03SEP2009
*/

In the example above, you may use “INFORMAT analysis $15.” instead of the LENGTH statement. INFORMAT tells how data are read, while FORMAT tells the format to be displayed. MMDDYY10. reads data in the MM/DD/YYYY format. DATE9. displays date in the DDMMMYYYY format. Without the FORMAT for year, SAS will return odd numbers such as 14884, which are internally used in SAS.

The following example reads an ASCII text file with a comma delimited. Remember the default delimiter is a blank. See the INFILE statement for the detail.

DATA _NULL_;
INFILE ‘a:\tiger.dat’ DELIMITER=’,’ STOPOVER;
INPUT name $ id score

RUN;

MODIFIED LIST INPUT

The modified list style is a mixture of the list input and the formatted input. This style can deal with ill-structured data. There are three format modifiers to be used especially when reading complex data.

  • colon (:) reads data longer than standard 8 characters or numbers until encountering specified delimiter or reaching the variable width specified.
  • ampersand (&) format modifier reads character values that contain embedded blanks with list input and reads until encountering more than one consecutive delimiter. You may include ” (double quotes) in the value of a character variable.
  • tilde (~) reads and retains single quotation marks, double quotation marks, and delimiters within quoted character values. That is, double quotation marks enclosing a string are treated as values of a character variable.

The following example illustrates how : and & work in INPUT. The “Lindblom80” in the first row is trimed since it exceeds 8 characters; only first 8 characters, as specified in the INPUT statement, are read and the last two characters “08” are ignored. In the second row, SAS reads the first four characters “Park”, which are shorter than 8 characters, and then encounters a comma (delimiter); SAS stops reading data for the variable “name” and moves on to next variable. The variable “title” is defined by & with a maximum 50 characters. The delimiter, a comma, in the first and third row is treated as a character value. Two consecutive double quotation marks are read as a double quotation marks. Therefore, the title of the second observation is Readig “Small Is Beautiful” as shown in the output. Characters exceeding the maximum, 50 characters in this case, will be ignored.

DATA modified;
INFILE DATALINES DELIMITER=’,’ DSD;
INPUT name : $8. title & $50.;
DATALINES;
Lindblom80,”Still Muddling, Not Yet Through”
Park, “Reading “”Small Is Beautiful”””
Simon, “””It was a disaster,”” he continue…”
RUN;

/* Output
Lindblom Still Muddling, Not Yet Through
Park     Reading “Small Is Beautiful”
Simon    “It was a disaster,” he continue…
*/

The INFILE statement above says that data are comma delimited and will be listed after DATALINES. DSD at the end of INFILE eliminates double quotation marks enclosing the character value when reading data. If you omit DSD, SAS will consider a comma in character values as a delimiter and read enclosing double quotation marks as character values. As a result, the output would look like,

Lindblom “Still Muddling
Park “Reading “”Small Is Beautiful”””
Simon “””It was a disaster,”” he continue…”

The second example shows how ~ (tilde) and DSD work together to read a string with a delimiter. SAS reads a comma in the string as a character value but does not eliminate double quoatation marks enclosing the string. If you omit DSD, the title of the second row will be ‘”Still Muddling’ because SAS treats a comma in the string as the delimiter and stops reading the character value for variable “title.”

DATA modified;
INFILE DATALINES DELIMITER=’,’ DSD;
INPUT name : $20. year : 4.0 title ~ $50.;
DATALINES;
Meyer and Rowan,1977,”Institutionalized Organization”
Lindblom,1979,”Still Muddling, Not Yet Through”
RUN;
/* Output
Meyer and Rowan 1977 “Institutionalized Organization”
Lindblom        1979 “Still Muddling, Not Yet Through”
*/
/* Output without DSD
Meyer and Rowan 1977 “Institutionalized Organization”
Lindblom 1979 “Still Muddling
*/

You may not ommit : after “year” in the INPUT statement above even when data are in the same fixed format. When the variable “year” is specified at the last of the list in the INPUT statement, : is not necessary.

COLUMN INPUT

The column input style reads the value of a variable from its specified column location. A variable name is followed by its starting and ending columns.

DATA columned;
INPUT name $ 1-5 id 6-12 score 14-17;
CARDS /*–1—-+—-2—*/;
Park 8740031 87.5
Hwang9301020 94.3

RUN;

SAS reads a variable “name” from 1 through 5 columns, id from 6 through 12 columns, and so on. This input style works good for well structured data.

FORMATTED INPUT

The formatted input style reads input values with specified inforamts after variable names. Informats provide the data type and the width of an input value. Numeric variables are expressed in the w.d format, where w represents the total length of a variable and d the number of digits below the decimal point. You cannot omit d even when d = 0. The use $CHARw. or $w. format is used for character variables, while the DATEw. or DDMMYYw. format is used for the date type.

DATA formatted;
INPUT name $5. id 7. score 4.1;
DATALINES /*–+—-2—*/;
Park 8740031 875 /* score=87.5 */
Hwang9301020 943 /* score=94.3 */

RUN;

You can use parentheses to simplify expressions.

DATA formatted;
INPUT name $5. id 7. (grade1-grade3) (3.);
DATALINES /*–+—-2—*/;
Park 8740031 89 95100
Hwang9301020100 93 99

RUN;

The following example illustrates how effectively the formatted input uses column holders,informats (e.g., COMMAn., DOLLarn., PERCENTn., and MMDDYY10.), and parentheses. SAS reads a variable x1 as a string five characters long, a numeric variable x2 7 digits long without decimal point, three digit numeric variables x3 through x5, and then skip one column (+1) before reading a numeric variable income containing commas.

DATA formatted;
INPUT (x1-x5) ($CHAR5. 7. 3*3.0) +1 income COMMA7.;
DATALINES /*–+—-2—-+—-3*/;
Park 8740031 89 95100 84,895
Hwang9301020100 93 99 168,579

RUN;
/* Output
Park 8740031 89 95 100 84895
Hwang 9301020 100 93 99 168579
*/

The formattted input can use both column and line pointer controls. These pointer controls are very useful when reading multiple observations from the same line or reading an observation from multiple lines.

  • @n, a column control, moves the input pointer to nth column
  • @@, a line holder, keeps the pointer in the line and wait other data input
  • +n, a column control, moves the pointer to the right by n columns
  • #n, a row control, goes to the nth line
  • / goes to the first column of the next line

NAMED INPUT

The named input reads a data value that follows its variable name. A variable name and its data value are separated by an equal sign. String data are NOT enclosed by double quotation marks in this style. Like the list style, the named style supports standard length of variables only. The format provides some sorts of flexibility, but it will not be appropriate for a large data set.

DATA named;
INPUT name=$ id= grade=;
DATALINES;
name=Park id=8740031 grade=89
name=Hwang id=9301020 grade=100

RUN;

MIXED INPUT

The INPUT statement can contain list input, column input, formatted input, and/or named input.

DATA mixed;
INPUT name $ 1-5 @7 id $7. +1 grade1 3. grade2 18-22;
CARDS /*–1—-+—-2—*/;
Park 8740031 89 95.1
Hwang 9301020 100 93.9

RUN;

READING MULTIPLE OBSERVATIONS

Let us read multiple observations in a line using the formatted input style. The following script reads string variables “name” and “id” consecutively, and reads three digit numeric variables x1 through x3, and then keep reading next observations, if available, without moving to next line.

DATA formatted;
INPUT name $ id $ (x1-x3)(3.) @@;
CARDS /*–1—-+—-2—-+—-3—-+—-4—-+—-5-*/;
Park 8740031  89 95100 Choi 9730625 100100 95
Hwang 9301020 100 93 99 …
;RUN;
/* Output
Park 8740031 89 95 100
Choi 9730625 100 100 95
Hwang 9301020 100 93 99
*/

The following example reads data using a DO loop.

DATA rbd_block;
INPUT treat $ @@;
DO block=’High’, ‘Medium’, ‘Low’; /* DO block=1 TO 3;*/
INPUT income @@; OUTPUT;
END;
DATALINES;
Drug1 34 55 34
Drug2 45 56 32
Drug3 45 56 32
;RUN;
/* Output
1 Drug1 High 34
2 Drug1 Medi 55
3 Drug1 Low 34
4 Drug2 High 45
5 Drug2 Medi 56
6 Drug2 Low 32
7 Drug3 High 45
8 Drug3 Medi 56
9 Drug3 Low 32
*/

Suppose individual observations have different numbers of repeatition. Pay attention to IF and OUTPUT statements.

DATA repeat;
INPUT crop $ no @;
DROP no;
IF no GT 0 THEN DO;
DO trial=1 TO no;
INPUT cost benefit @;
OUTPUT;
END;
END;
DATALINES;
rice 3 54 87 98 77 57 67
bean 2 65 87 96 54
RUN;
/* Output
rice 1 54 87
rice 2 98 77
rice 3 57 67
bean 1 65 87
bean 2 96 54
*/

READING MULTIPLE LINES

Now, let us read observations whose data are provided in multiple lines. The #n or / indicates a data line to be read for the variable.

DATA spanned;
INPUT #1 No 7.0 #2 name $CHAR15. / address $CHAR50. #4 phone $CHAR12.;
DATALINES;
000001
Park
2451 E. 10th St. APT 311
812-857-9425
000002
Hun
800 N. Union St. APT 525
812-857-6256
RUN;
/* Output
1 Park 2451 E. 10th St. APT 311 812-857-9425
2 Hun 800 N. Union St. APT 525 812-857-6256
*/

The INPUT statement above says that read a 7 digit numeric variable “No” from the first line (#1), a 15 character string variable “name” from the second line (#2), a 50 character string variable “address” from the next line (/), and a 12 character string variable “phone” from the fourth line (#4). Alternatively, the INPUT may be replaced by “INPUT No 7.0 / Name $15 / Address $50 / Phone $12;.”

SAS Date, Time, and Datetime Functions


SAS date, time, and datetime functions are used to perform the following tasks:

  • compute date, time, and datetime values from calendar and time-of-day values
  • compute calendar and time-of-day values from date and datetime values
  • convert between date, time, and datetime values
  • perform calculations that involve time intervals
  • provide information about time intervals
  • provide information about seasonality

For all interval functions, you can supply the intervals and other character arguments either directly as a quoted string or as a SAS character variable. When you use a character variable, you should set the length of the character variable to at least the length of the longest string for that variable that is used in the DATA step.

Also, to ensure correct results when using interval functions, use date intervals with date values and datetime intervals with datetime values.

See SAS Language Reference: Dictionary for a complete description of these functions.

The following list shows SAS date, time, and datetime functions in alphabetical order.

DATE()

returns today’s date as a SAS date value.

DATEJUL( yyddd )

returns the SAS date value when given the Julian date in yyddd or yyyyddd format. For example, DATE = DATEJUL(99001); assigns the SAS date value ’01JAN99’D to DATE, and DATE = DATEJUL(1999365); assigns the SAS date value ’31DEC1999’D to DATE.

DATEPART( datetime )

returns the date part of a SAS datetime value as a date value.

DATETIME()

returns the current date and time of day as a SAS datetime value.

DAY( date )

returns the day of the month from a SAS date value.

DHMS( date, hour, minute, second )

returns a SAS datetime value for date, hour, minute, and second values.

HMS( hour, minute, second )

returns a SAS time value for hour, minute, and second values.

HOLIDAY( ‘holiday‘, year )

returns a SAS date value for the holiday and year specified. Valid values for holiday are ‘BOXING’, ‘CANADA’, ‘CANADAOBSERVED’, ‘CHRISTMAS’, ‘COLUMBUS’, ‘EASTER’, ‘FATHERS’, ‘HALLOWEEN’, ‘LABOR’, ‘MLK’, ‘MEMORIAL’, ‘MOTHERS’, ‘NEWYEAR’,’THANKSGIVING’, ‘THANKSGIVINGCANADA’, ‘USINDEPENDENCE’, ‘USPRESIDENTS’, ‘VALENTINES’, ‘VETERANS’, ‘VETERANSUSG’, ‘VETERANSUSPS’, and ‘VICTORIA’. For example: EASTER2000 = HOLIDAY(’EASTER’, 2000);

HOUR( datetime )

returns the hour from a SAS datetime or time value.

INTCINDEX( ‘date-interval‘, date )

INTCINDEX( ‘datetime-interval‘, datetime )

returns the index of the seasonal cycle when given an interval and an appropriate SAS date, datetime, or time value. For example, the seasonal cycle for INTERVAL=’DAY’ is ‘WEEK’, soINTCINDEX(’DAY’,’01SEP78’D); returns 35 because September 1, 1978, is the sixth day of the th week of the year. For correct results, date intervals should be used with date values, and datetime intervals should be used with datetime values.

INTCK( ‘date-interval‘, date1, date2 <, ‘method’> )

INTCK( ‘datetime-interval‘, datetime1, datetime2 <, ‘method’> )

returns the number of boundaries of intervals of the given kind that lie between the two date or datetime values. The optional method argument specifies that the intervals are counted using either a discrete or a continuous method. The default DISCRETE (or DISC or D) method uses discrete time intervals. For the DISCRETE method, the distance in MONTHS between January 31, 2000, and February 1, 2000, is one month. The CONTINUOUS (or CONT or C) method uses continuous time intervals. For the CONTINUOUS method, the distance in MONTHS between January 15, 2000, and February 14, 2000, is zero, but the distance in MONTHS between January 15, 2000, and February 15, 2000, is one month.

INTCYCLE( ‘interval‘ <, seasonality> )

returns the interval of the seasonal cycle, given a date, time, or datetime interval. For example, INTCYCLE(‘MONTH’) returns ‘YEAR’ because the months January, February, …, December constitute a yearly cycle. INTCYCLE(‘DAY’) returns ‘WEEK’ because Sunday, Monday, …, Saturday constitute a weekly cycle.

You can specify the optional seasonality argument to construct a cycle other than the default seasonal cycle. For example, INTCYCLE(‘MONTH’, 3) returns ‘QTR’. The optional second argument is the seasonal frequency.

INTFIT( date1, date2, ‘D‘ )

INTFIT( datetime1, datetime2, ‘DT‘ )

INTFIT( obs1, obs2, ‘OBS‘ )

returns an interval that fits exactly between two SAS date, datetime, or observation values. That is, if the interval result of the INTFIT function is used with date1, , and SAMEDAY alignment in the INTNX function, then the result is date2. This concept is illustrated in the following example, where result1 is the same as date1 and result2 is the same as date2.

   FitInterval = INTFIT( date1, date2, 'D' );
   result1 = INTNX( FitInterval, date1, 0, 'SAMEDAY');
   result2 = INTNX( FitInterval, date1, 1, 'SAMEDAY');

More than one interval can fit the preceding definition. For example, two SAS date values that are seven days apart could be fit with either ‘DAY7’ or ‘WEEK’. The INTFIT function chooses the more common interval, so ‘WEEK’ is the result when the dates are seven days apart. The INTFIT function can be used to detect the possible frequency of the time series or to analyze frequencies of other events in a time series, such as outliers or missing values.

INTFMT(‘interval‘ ,’size‘)

returns a recommended format when given a date, time, or datetime interval for displaying the time ID values associated with a time series of the given interval. The second argument to INTFMT controls the width of the year for date formats; it can take the value ‘long’ or ‘l’ to specify that the returned format display a four-digit year or the value ‘short’ or ‘s’ to specify that the returned format display a two-digit year.

INTGET( date1, date2, date3 )

INTGET( datetime1, datetime2, datetime3 )

returns an interval that fits three consecutive SAS date or datetime values. The INTGET function examines two intervals: the first interval between date1 and date2, and the second interval between date2 and date3. In order for an interval to be detected, either the two intervals must be the same or one interval must be an integer multiple of the other interval. That is, INTGET assumes that at least two of the dates are consecutive points in the time series, and that the other two dates are also consecutive or represent the points before and after missing observations. The INTGET function assumes that large values are SAS datetime values, which are measured in seconds, and that smaller values are SAS date values, which are measured in days. The INTGET function can be used to detect the possible frequency of the time series or to analyze frequencies of other events in a time series, such as outliers or missing values.

INTINDEX( ‘date-interval‘, date <, seasonality> )

INTINDEX( ‘datetime-interval‘, datetime <, seasonality> )

returns the seasonal index for the specified date or datetime interval and an appropriate date or datetime value. The seasonal index is a number that represents the position of the date or datetime value in the seasonal cycle of the specified interval. For example, INTINDEX(’MONTH’,’01DEC2000’D); returns 12 because monthly data is yearly periodic and DECEMBER is the th month of the year. However, INTINDEX(’DAY’,’01DEC2000’D); returns 6 because daily data is weekly periodic and December 01, 2000, is a Friday, the sixth day of the week. To correctly identify the seasonal index, the interval specification should agree with the date or datetime value. For example, INTINDEX(’DTMONTH’,’01DEC2000’D); and INTINDEX(’MONTH’,’01DEC2000:00:00:00’DT);do not return the expected value of 12. However, both INTINDEX(’MONTH’,’01DEC2000’D); and INTINDEX(’DTMONTH’,’01DEC2000:00:00:00’DT); return the expected value of 12.

You can specify the optional seasonality argument to use a seasonal cycle other than the default seasonal cycle. For example, INTINDEX(’MONTH’,’01APR2000’D); returns the value 4, to indicate the fourth month of the year. However, INTINDEX(’MONTH’,’01APR2000’D,3); and INTINDEX(’MONTH’,’01APR2000’D,’QTR’); return the value 1 to indicate the first month of the quarter. Specifying either 3 or ‘QTR’ for the third argument uses a quarterly seasonal cycle instead of the default yearly seasonal cycle.

INTNX( ‘date-interval‘, date, n <, ‘alignment’> )

INTNX( ‘datetime-interval‘, datetime, n <, ‘alignment’> )

returns the date or datetime value of the beginning of the interval that is n intervals from the interval that contains the given date or datetime value. The optional alignment argument specifies that the returned date is aligned to the beginning, middle, or end of the interval. Beginning is the default. In addition, you can specify SAME (S) alignment. The SAME alignment bases the alignment of the calculated date or datetime value on the alignment of the input date or datetime value. As illustrated in the following example, the SAME alignment can be used to calculate the meaning of “same day next year” or “same day two weeks from now.”

   nextYear = INTNX( 'YEAR', '15Apr2007'D, 1, 'S' );
   TwoWeeks = INTNX( 'WEEK', '15Apr2007'D, 2, 'S' );

The preceding example returns ’15Apr2008’D for nextYear and ’29Apr2007’D for TwoWeeks.

For all values of alignment, the number of discrete intervals n between the input date and the resulting date agrees with the input value. In the following example, the result is always that n2 = n1:

   date2 = INTNX( interval, date1, n1, align );
   n2 = INTCK( interval, date1, date2 );

The preceding example uses the DISCRETE method of the INTCK function by default. The result n2 = n1 does not always apply when the CONTINUOUS method of the INTCK function is specified.

INTSEAS( ‘interval‘ <, seasonality> )

returns the length of the seasonal cycle for the specified date or datetime interval. The length of a seasonal cycle is the number of intervals in a seasonal cycle. For example, when the interval for a time series is described as monthly, many procedures use the option INTERVAL=MONTH to indicate that each observation in the data corresponds to a particular month. Monthly data are considered to be periodic for a one-year seasonal cycle. There are 12 months in one year, so the number of intervals (months) in a seasonal cycle (year) is 12. For quarterly data, there are 4 quarters in one year, so the number of intervals in a seasonal cycle is 4. The periodicity is not always one year. For example, INTERVAL=DAY is considered to have a seasonal cycle of one week, and because there are 7 days in a week, the number of intervals in a seasonal cycle is 7.

You can specify the optional seasonality argument to use a seasonal cycle other than the default seasonal cycle. For example, INTSEAS(‘MONTH’, 3) and INTSEAS(‘MONTH’, ‘QTR’) both specify a quarterly seasonal cycle and return the value 3. If the optional seasonality argument is numeric, it is the seasonal frequency. If the optional seasonality argument is character, it is the seasonal cycle.

INTSHIFT( ‘interval‘ )

returns the shift interval that applies to the shift index if a subperiod is specified. For example, YEAR intervals are shifted by MONTH, so INTSHIFT(‘YEAR’) returns ‘MONTH’.

INTTEST( ‘interval‘ )

returns 1 if the interval name is valid, 0 otherwise. For example, VALID = INTTEST(’MONTH’); should set VALID to 1, while VALID = INTTEST(’NOTANINTERVAL’); should set VALID to 0. The INTTEST function can be useful in verifying which values of multiplier n and the shift index s are valid in constructing an interval name.

JULDATE( date )

returns the Julian date from a SAS date value. The format of the Julian date is either yyddd or yyyyddd depending on the value of the system option YEARCUTOFF=. For example, using the default system option values, JULDATE( ’31DEC1999’D ); returns 99365, while JULDATE( ’31DEC1899’D ); returns 1899365.

MDY( month, day, year )

returns a SAS date value for month, day, and year values.

MINUTE( datetime )

returns the minute from a SAS time or datetime value.

MONTH( date )

returns the numerical value for the month of the year from a SAS date value. For example, MONTH=MONTH(’01JAN2000’D); returns , the numerical value for January.

NWKDOM( n, weekday, month, year )

returns a SAS date value for the th weekday of the month and year specified. For example, Thanksgiving is always the fourth (n=4) Thursday (weekday=5) in November (month=11). ThusTHANKS2000 = NWKDOM( 4, 5, 11, 2000); returns the SAS date value for Thanksgiving in the year 2000. The last weekday of a month can be specified by using n=5. Memorial Day in the United States is the last (n=5) Monday (weekday=2) in May (month=5), and so MEMORIAL2002 = NWKDOM( 5, 2, 5, 2002); returns the SAS date value for Memorial Day in 2002. Because always specifies the last occurrence of the month and most months have only 4 instances of each day, the result for is often the same as the result for . NWKDOM is useful for calculating the SAS date values of holidays that are defined in this manner.

QTR( date )

returns the quarter of the year from a SAS date value.

SECOND( date )

returns the second from a SAS time or datetime value.

TIME()

returns the current time of day.

TIMEPART( datetime )

returns the time part of a SAS datetime value.

TODAY()

returns the current date as a SAS date value. (TODAY is another name for the DATE function.)

WEEK( date <, ‘descriptor’> )

returns the week of year from a SAS date value. The algorithm used to calculate the week depends on the descriptor, which can take the value ‘U’, ‘V’, or ‘W’.

If the descriptor is ‘U,’ weeks start on Sunday and the range is to . If weeks and exist, they are only partial weeks. Week 52 can be a partial week.

If the descriptor is ‘V’, the result is equivalent to the ISO 8601 week of year definition. The range is to . Week is a leap week. The first week of the year, Week , and the last week of the year, Week or , can include days in another Gregorian calendar year.

If the descriptor is ‘W’, weeks start on Monday and the range is to . If weeks and exist, they are only partial weeks. Week 52 can be a partial week.

WEEKDAY( date )

returns the day of the week from a SAS date value. For example WEEKDAY=WEEKDAY(’17OCT1991’D); returns , the numerical value for Thursday.

YEAR( date )

returns the year from a SAS date value.

YYQ( year, quarter )

returns a SAS date value for year and quarter values.

SAS FUNCTIONS


Arithmetic Functions

ABS(argument)

returns absolute value

DIM<n>(array-name)

returns the number of elements in a one-dimensional array or the number of elements in a specified dimension of a multidimensional array.

n specifies the dimension, in a multidimensional array, for which you want to know the the number of elements.

DIM(array-name,bound-n)

returns the number of elements in a one-dimensional array or the number of elements in the specified dimension of a multidimensional array

bound-n specifies the dimension in a multidimensional array, for which you want to know the number of elements.

HBOUND<n>(array-name)

returns the upper bound of an array

HBOUND(array-name,bound-n)

returns the upper bound of an array

LBOUND<n>(array-name)

returns the lower bound of an array

LBOUND(array-name,bound-n)

returns the lower bound of an array

MAX(argument,argument, …)

returns the largest value of the numeric arguments

MIN(argument,argument, …)

returns the smallest value of the numeric arguments

MOD(argument-1, argument-2)

returns the remainder

SIGN(argument)

returns the sign of a value or 0

SQRT(argument)

returns the square root

Character Functions

BYTE(n)

returns one character in the ASCII or EBCDIC collating sequence where nis an integer representing a specific ASCII or EBCDIC character

COLLATE(start-position<,end-position>) | (start-position<,,length>)

returns an ASCII or EBCDIC collating sequence character string

COMPBL(source)

removes multiple blanks between words in a character string

COMPRESS(source<,characters-to-remove>)

removes specific characters from a character string

DEQUOTE(argument)

removes quotation marks from a character value

INDEX(source,excerpt)

searches the source for the character string specified by the excerpt

INDEXC(source,excerpt-1<, … excerpt-n>)

searches the source for any character present in the excerpt

INDEXW(source,excerpt)

searches the source for a specified pattern as a word

LEFT(argument)

left-aligns a SAS character string

LENGTH(argument)

returns the length of an argument

LOWCASE(argument)

converts all letters in an argument to lowercase

QUOTE(argument)

adds double quotation marks to a character value

RANK(x)

returns the position of a character in the ASCII or EBCDIC collating sequence

REPEAT(argument,n)

repeats a character expression

REVERSE(argument)

reverses a character expression

RIGHT(argument)

right-aligns a character expression

SCAN(argument,n<,delimiters>)

returns a given word from a character expression

SOUNDEX(argument)

encodes a string to facilitate searching

SUBSTR(argument,position<,n>)=characters-to-replace

replaces character value contents

var=SUBSTR(argument,position<,n>)

extracts a substring from an argument. (var is any valid SAS variable name.)

TRANSLATE(source,to-1,from-1<,…to-n,from-n>)

replaces specific characters in a character expression

TRANWRD(source,target,replacement)

replaces or removes all occurrences of a word in a character string

TRIM(argument)

removes trailing blanks from character expression and returns one blank if the expression is missing

TRIMN(argument)

removes trailing blanks from character expressions and returns a null string if the expression is missing

UPCASE(argument)

converts all letters in an argument to uppercase

VERIFY(source,excerpt-1<,…excerpt-n)

returns the position of the first character unique to an expression

Date and Time Functions

DATDIF(sdate,edate,basis)

returns the number of days between two dates

DATE()

returns the current date as a SAS date value

DATEJUL(julian-date)

converts a Julian date to a SAS date value

DATEPART(datetime)

extracts the date from a SAS datetime value

DATETIME()

returns the current date and time of day

DAY(date)

returns the day of the month from a SAS date value

DHMS(date,hour,minute,second)

returns a SAS datetime value from date, hour, minute, and second

HMS(hour,minute,second)

returns a SAS time value from hour, minute, and second

HOUR(<time | datetime>)

returns the hour from a SAS time or datetime value

INTCK(‘interval’,from,to)

returns the number of time intervals in a given time span

INTNX(‘interval’,start-from,increment<,’alignment’>)

advances a date, time, or datetime value by a given interval, and returns a date, time, or datetime value

JULDATE(date)

returns the Julian date from a SAS date value

MDY(month,day,year)

returns a SAS date value from month, day, and year values

MINUTE(time | datetime)

returns the minute from a SAS time or datetime value

MONTH(date)

returns the month from a SAS date value

QTR(date)

returns the quarter of the year from a SAS date value

SECOND(time | datetime)

returns the second from a SAS time or datetime value

TIME()

returns the current time of day

TIMEPART(datetime)

extracts a time value from a SAS datetime value

TODAY()

returns the current date as a SAS date value

WEEKDAY(date)

returns the day of the week from a SAS date value

YEAR(date)

returns the year from a SAS date value

YRDIF(sdate,edate,basis)

returns the difference in years between two dates

YYQ(year,quarter)

returns a SAS date value from the year and quarter

Mathematical Functions

AIRY(x)

returns the value of the AIRY function

DAIRY(x)

returns the derivative of the AIRY function

DIGAMMA(argument)

returns the value of the DIGAMMA function

ERF(argument)

returns the value of the (normal) error function

ERFC(argument)

returns the value of the (normal) error function

EXP(argument)

returns the value of the exponential function

GAMMA(argument)

returns the value of the GAMMA function

IBESSEL(nu,x,kode)

returns the value of the modified bessel function

JBESSEL(nu,x)

returns the value of the bessel function

LGAMMA(argument)

returns the natural logarithm of the GAMMA function

LOG(argument)

returns the natural (base e) logarithm

LOG2(argument)

returns the logarithm to the base 2

LOG10(argument)

returns the logarithm to the base 10

TRIGAMMA(argument)

returns the value of the TRIGAMMA function

Noncentrality Functions

CNONCT(x,df,prob)

returns the noncentrality parameter from a chi-squared distribution

FNONCT(x,ndf,ddf,prob)

returns the value of the noncentrality parameter of an F distribution

TNONCT(x,df,prob)

returns the value of the noncentrality parameter from the student’s t distribution

Probability and Density Functions

CDF(‘dist’,quantile,parm-1,…,parm-k)

computes cumulative distribution functions

LOGPDF|LOGPMF(‘dist’,quantile,parm-1,…,parm-k)

computes the logarithm of a probability density (mass) function. The two functions are identical.

LOGSDF(‘dist’,quantile,parm-1,…,parm-k)

computes the logarithm of a survival function

PDF|PMF(‘dist’,quantile,parm-1,…,parm-k)

computes probability density (mass) functions

POISSON(m,n)

returns the probability from a POISSON distribution

PROBBETA(x,a,b)

returns the probability from a beta distribution

PROBBNML(p,n,m)

returns the probability from a binomial distribution

PROBCHI(x,df<,nc>)

returns the probability from a chi-squared distribution

PROBF(x,ndf,ddf<,nc>)

returns the probability from an F distribution

PROBGAM(x,a)

returns the probability from a gamma distribution

PROBHYPR(N,K,n,x<,r>)

returns the probability from a hypergeometric distribution

PROBMC

probabilities and critical values (quantiles) from various distributions for multiple comparisons of the means of several groups.

PROBNEGB(p,n,m)

returns the probability from a negative binomial distribution

PROBBNRM(x,y,r)

standardized bivariate normal distribution

PROBNORM(x)

returns the probability from the standard normal distribution

PROBT(x,df<,nc>)

returns the probability from a Student’s t distribution

SDF(‘dist’,quantile,parm-1,…,parm-k)

computes a survival function

Quantile Functions

BETAINV(p,a,b)

returns a quantile from the beta distribution

CINV(p,df<,nc>)

returns a quantile from the chi-squared distribution

FINV(p,ndf,ddf<,nc>)

returns a quantile from the F distribution

GAMINV(p,a)

returns a quantile from the gamma distribution

PROBIT(p)

returns a quantile from the standard normal distribution

TINV(p,df<,nc>)

returns a quantile from the t distribution

Sample Statistics Functions

CSS(argument,argument,…)

returns the corrected sum of squares

CV(argument,argument,…)

returns the coefficient of variation

KURTOSIS(argument,argument,…)

returns the kurtosis (or 4th moment)

MAX(argument,argument, …)

returns the largest value

MIN(argument,argument, …)

returns the smallest value

MEAN(argument,argument, …)

returns the arithmetic mean (average)

MISSING(numeric-expression | character-expression)

returns a numeric result that indicates whether the argument contains a missing value

N(argument,argument, ….)

returns the number of nonmissing values

NMISS(argument,argument, …)

returns the number of missing values

ORDINAL(count,argument,argument,…)

returns the largest value of a part of a list

RANGE(argument,argument,…)

returns the range of values

SKEWNESS(argument,argument,argument,…)

returns the skewness

STD(argument,argument,…)

returns the standard deviation

STDERR(argument,argument,…)

returns the standard error of the mean

SUM(argument,argument,…)

returns the sum

USS(argument,argument,…)

returns the uncorrected sum of squares

VAR(argument,argument,…)

returns the variance

State and ZIP Code Functions

FIPNAME(expression)

converts FIPS codes to uppercase state names

FIPNAMEL(expression)

converts FIPS codes to mixed case state names

FIPSTATE(expression)

converts FIPS codes to two-character postal codes

STFIPS(postal-code)

converts state postal codes to FIPS state codes

STNAME(postal-code)

converts state postal codes to uppercase state names

Tip:

For Version 6, the maximum length of the value that is returned is 200 characters. For Version 7 and beyond, the maximum length is 20 characters.

STNAMEL(postal-code)

converts state postal codes to mixed case state names

Tip:

For Version 6, the maximum length of the value that is returned is 200 characters. For Version 7 and beyond, the maximum length is 20 characters.

ZIPFIPS(zip-code)

converts ZIP codes to FIPS state codes

ZIPNAME(zip-code)

converts ZIP codes to uppercase state names

ZIPNAMEL(zip-code)

converts ZIP codes to mixed case state names

ZIPSTATE(zip-code)

converts ZIP codes to state postal codes

Trigonometric and Hyperbolic Functions

ARCOS(argument)

returns the arccosine

ARSIN(argument)

returns the arcsine

ATAN(argument)

returns the arctangent

COS(argument)

returns the cosine

COSH(argument)

returns the hyperbolic cosine

SIN(argument)

returns the sine

SINH(argument)

returns the hyperbolic sine

TAN(argument)

returns the tangent

TANH(argument)

returns the hyperbolic tangent

Truncation Functions

CEIL(argument)

returns the smallest integer that is greater than or equal to the argument

FLOOR(argument)

returns the largest integer that is less than or equal to the argument

FUZZ(argument)

returns the nearest integer if the argument is within 1E-12

INT(argument)

returns the integer value

ROUND(argument,round-off-unit)

rounds to the nearest round-off unit

TRUNC(number, length)

truncates a numeric value to a specified length

Variable Information Functions

GETVARC(data-set-id,var-num)

returns the value of a SAS data set character variable

GETVARN(data-set-id,var-num)

returns the value of a SAS data set numeric variable

VARFMT(data-set-id,var-num)

returns the format assigned to a SAS data set variable

VARINFMT(data-set-id,var-num)

returns the informat assigned to a SAS data set variable

VARLABEL(data-set-id,var-num)

returns the label assigned to a SAS data set variable

VARLEN(data-set-id,var-num)

returns the length of a SAS data set variable

VARNAME(data-set-id,var-num)

returns the name of a SAS data set variable

VARNUM(data-set-id,var-name)

returns the number of a SAS data set variable’s position in a SAS data set

VARRAY(name)

returns a value that indicates whether the specified name is an array

VARRAYX(expression)

returns a value that indicates whether the value of the specified argument is an array

VARTYPE(data-set-id,var-num)

returns the data type of a SAS data set variable

VFORMAT(var)

returns the format associated with the given variable

VFORMATD(var)

returns the format decimal value associated with the given variable

VFORMATDX(expression)

returns the format decimal value associated with the value of the specified argument

VFORMATN(var)

returns the format name associated with the given variable

VFORMATNX(expression)

returns the format name associated with the value of the specified argument

VFORMATW(var)

returns the format width associated with the given variable

VFORMATWX(expression)

returns the format width associated with the value of the specified argument

VFORMATX(expression)

returns the format associated with the value of the specified argument

VINARRAY(var)

returns a value that indicates whether the given variable is a member of an array

VINARRAYX(expression)

returns a value that indicates whether the value of the specified argument is a member of an array

VINFORMAT(var)

returns the informat associated with the given variable

VINFORMATD(var)

returns the informat decimal value associated with the given variable

VINFORMATDX(expression)

returns the informat decimal value associated with the value of the specified argument

VINFORMATN(var)

returns the informat name associated with the given variable

VINFORMATNX(expression)

returns the informat name associated with the value of the specified argument

VINFORMATW(var)

returns the informat width associated with the given variable

VINFORMATWX(expression)

returns the informat width associated with the value of the specified argument

VINFORMATX(expression)

returns the informat associated with the value of the specified argument

VLABEL(var)

returns the label associated with the given variable

VLABELX(expression)

returns the variable label for the value of a specified argument

VLENGTH(var)

returns the compile-time (allocated) size of the given variable

VLENGTHX(expression)

returns the compile-time (allocated) size for the value of the specified argument

VNAME(var)

returns the name of the given variable

VNAMEX(expression)

validates the value of the specified argument as a variable name

VTYPE(var)

returns the type (character or numeric) of the given variable

VTYPEX(expression)

returns the type (character or numeric) for the value of the specified argument

 

Regards,

SAS INDIA

Bill Gates is putting his own money into a small impact-investing fund focused on India


BillBill Gates arguably ushered in the current golden age of philanthropy. Now the world’s richest man has endorsed another way to put money to work for good: impact investing.

It’s safe to say Gates doesn’t need the market-rate returns expected from his new venture-capital investment in Unitus Seed Fund. His modest commitment closes a $20 million US-Indian fund through which Unitus has taken early stakes in more than a dozen for-profit startups providing health, education and livelihoods, for Indian families living on less than $10 a day. (Impact investments are intended to generate—and measure and report—social and environmental impact alongside a financial return.)

The investment puts Gates on one side of a debate that has divided his fellow tech titans and billionaires, and now apparently separates the Microsoft co-founder from his friend Warren Buffett. Should private investors back businesses with explicit social and environmental missions and metrics?

Buffett and Gates are co-founders of the Giving Pledge, which has signed up more than 125 billionaires to give away at least half of their fortunes. But Buffett has favored the traditional separation of business and charity. “I think it’s tough to serve two masters,” he told a conference last year. “I would rather have the investment produce the capital and then have an organization totally focused on the philanthropic aspects.”

Marc Andreessen, the Silicon Valley venture capitalist who challenged Gates in the Netscape-Microsoft Web browser wars of the 1990s, has also been critical of the idea. Two years ago, Andreessen said “I would run screaming from a B Corp,” or for-benefit company that adopts explicit social goals, which he said are distractions for startups.

“The split model makes me nervous and I don’t think we would ever touch that,” Andreessen said on a panel. “It’s like a houseboat. It’s not a great house and not a great boat.”

Will Poole, co-founder and managing director of Unitus, recently spent several days on a houseboat in Kerala, in India’s southwest. “It was an excellent boat and a fine house,” Poole said, “and we provided local economic development at the same time.”

Unitus, with headquarters in Seattle and Bangalore, in its first year has made 14 investments of generally between $100,000 and $250,000 in companies such as Smile Merchants, which operates low-income dental clinics near Mumbai, and Hippocampus Learning Centers, a network of private kindergartens serving more than 6,000 rural and low-income students.

Poole, who spent 13 years at Microsoft, got Gates on board after gaining commitments from other high- and ultra-high net worth investors. Unitus has attracted 15 Indian nationals and more than a dozen non-resident Indians, including venture capitalist Vinod Khosla, Romesh Wadhwani, founder of Aspect Development and Steve Singh, CEO of Concur Technologies, which SAP recently agreed to buy for more than $8 billion.

Gates’ investment in Unitus comes out of a personal fund, not from the Gates Foundation, which has set aside $1 billion for to provide equity, loans and loan guarantees to for-profit companies. Gates personally has made a number of food and energy investments, including Hampton Creek Foods and EcoMotors, a low-emission engine maker. But Unitus is apparently his first investment in a fund or company targeting so-called base-of-the-pyramid customers, the poor in the developing world.

“Impact investing is a powerful model with the potential to build markets and drive change for the people who need it most,” Gates said in a statement confirming the investment in Unitus.

Some of the naysayers may be softening their positions. Andreessen’s wife, Laura Arrillaga-Andreessen, is the founder of a social venture-capital firm and earlier this year, Andreessen’s firm, Andreessen Horowitz, invested in AltSchool, a network of micro-schools offering personalized education for children…that is in the process of becoming a certified B Corp.

SAS Global Certification Program


SAS Certifications

 

Regards,

SAS INDIA

MISSING(), NMISS() and the CMISS() functions


SAS provides several functions to test for missing values but in this post we will focus on MISSING(), CMISS() and NMISS() functions. The NMISS() function is reserved for numeric variables. The MISSING() and CMISS() functions can be used with either character or numeric variables.  The CMISS() and NMISS() functions are designed by SAS to count the number of arguments with missing values whereas the MISSING function checks whether or not a variable is missing. The MISSING(), CMISS(), and  NMISS() functions provide a simple approach to check for missing values and these functions let you write few lines of code by avoiding large if-statements when you need to check for missing values in several values at the same time.

MISSING() function is very useful when you need to check any variable has a missing value or not, but not sure if it’s a character or numeric? MISSING function works for either character or numeric variables and it also checks for the special numeric missing values (.A, .B,.C ._ etc)as well. The MISSING() function produces a numeric result (0 or 1) if the data point is present or missing. MISSING(varname) is the same as MISSING(varname)=1.  MISSING(varname)=0 specifies when the data point is present.

The MISSING function is particularly useful if you use special missing values since ‘if varname=.’ will not identify all missing values in such cases.

NOTE: Missing value is not consistent in SAS as it changes between numeric and character variables. A single period (.) represents the Numeric missing value. A single blank enclosed in single or double quotes (‘ ‘ or “  ” ) represents the Character missing value. A single period followed by a single letter or an underscore (ex: .A, .B, .Z, ._) represents Special numeric missing values. Please note that these special missing values available for numeric variables only.

The NMISS() function will count the number of arguments with missing values in the specified list of numeric variables. NMISS() function is very useful if you want to make sure if at least one variable in the list is not missing.

The CMISS() is available with SAS 9.2 and SAS Enterprise Guide 4.3 and is similar to the NMISS() function. The only difference is that it counts the number arguments that are missing for both character and numeric variables.

The NMISS() function returns the number of argument variables which have missing values. NMISS works with multiple numeric values, whereas MISSING works with only one value that can be either numeric or character.

Examples:

* count the number of the variables A, B, and C which have missing values;

count=nmiss(A, B, C);

count=nmiss(of A B C);

* count the number of the variables from Var1 to Var10 which have missing values;

count=nmiss(of var1-var10);

Examples:

x1=nmiss(1,0,.,2,5,.);

2

x2=nmiss(1,0);

0

x3=nmiss(of x1-x2);

0

%d bloggers like this: