summary stats for many vars

Ronein

Hello

Lets say that I want to run statistics on multiple variables.

Lets say that this code is working 100% and I get desired results.

My question-

Is there better way to write the code (using proc step instead of proc sql)?

Please note that in real word I run it on 300 fields and data set has 2 million rows


%macro dist(VAR,tbl,level,format);
%do j=1 %to 6;
%let version=%scan(&Vector_Versions.,&j.,+);

proc sql;
create table _dist_ as
select Left("&VAR.") as Var_name  LENGTH=300,
        put(&VAR.,&Format.) as Cat LENGTH=100,
		        count(*) as nr,
				calculated nr/(select count(*) as total_nr from  &tbl.) as PCT format=percent8.5
from &tbl.
group by  calculated CAT
union all
select  Left("&VAR.") as Var_name  LENGTH=300,
         'Total'  as CAT,
		 count(*) as nr,
		 1 as PCT format=percent8.5
from  &tbl.
;
quit;

proc sort data=_dist_;
by var_name level cat_order;
Run;

proc append data=_dist_ base=r_r.Accum_tbl force;quit;
%end;
%mend dist;
%dist(Var=Wealth,tbl=Y_ttt&version.,level=Y,Format=Fmt_1F.)
%dist(Var=LoansBalance ,tbl=Y_ttt&version.,level=Y,Format=Fmt_continuous.)
%dist(Var=AVG_PAYOFF_LAST_MON,tbl=Y_ttt&version.,level=Y,Format=Fmt_1F.)
%dist(Var=HIYUV_EXLISCAR_AMT,tbl=L_ttt&version.,level=L,Format=Fmt_1F.)


%macro dist(VAR,tbl,level,format);
%do j=1 %to 6;
%let version=%scan(&Vector_Versions.,&j.,+);

proc sql;
create table _dist_ as
select Left("&VAR.") as Var_name  LENGTH=300,
        put(&VAR.,&Format.) as Cat LENGTH=100,
		        count(*) as nr,
				calculated nr/(select count(*) as total_nr from  &tbl.) as PCT format=percent8.5
from &tbl.
group by  calculated CAT
union all
select  Left("&VAR.") as Var_name  LENGTH=300,
         'Total'  as CAT,
		 count(*) as nr,
		 1 as PCT format=percent8.5
from  &tbl.
;
quit;

proc sort data=_dist_;
by var_name level cat_order;
Run;

proc append data=_dist_ base=r_r.Accum_tbl force;quit;
%end;
%mend dist;
%dist(Var=Wealth,tbl=Y_ttt&version.,level=Y,Format=Fmt_1F.)
%dist(Var=LoansBalance ,tbl=Y_ttt&version.,level=Y,Format=Fmt_continuous.)
%dist(Var=AVG_PAYOFF_LAST_MON,tbl=Y_ttt&version.,level=Y,Format=Fmt_1F.)
%dist(Var=HIYUV_EXLISCAR_AMT,tbl=L_ttt&version.,level=L,Format=Fmt_1F.)

Tom

Here are two different ways using PROC SUMMARY or PROC FREQ to gather all of the counts you need in one pass through the data.

Let's use SASHELP.CLASS as our example dataset and get the counts for AGE and SEX variables.

%let varlist=age sex;

proc summary data=sashelp.class chartype ;
  class &varlist. / missing;
  types () &varlist. ;
  output out=summary;
run;

proc print data=summary;
run;

ods select none;
ods output onewayfreqs=freq;
proc freq data=sashelp.class ;
  tables &varlist. / missing ;
run;
ods select all;

proc print data=freq;
run;

You could then add a data step to collapse the multiple columns into one.

For example for the PROC SUMMARY output you could do something like this:

data _dist_;
  attrib Var_name  length=$300;
  attrib Cat length=$100;
	attrib nr length=8;
  attrib PCT length=8 format=percent8.5 ;
  set summary ;
  var_name=scan("Total &varlist",1+indexc(_type_,'1'),' ');
  Cat = vvaluex(var_name);
  nr = _freq_;
  retain Total;
  if var_name='Total' then total=_freq_;
  pct = nr / total;
  keep Var_name -- PCT;
run;

proc print data=_dist_;
run;

And for the PROC FREQ output something like this:

data _dist_;
  attrib Var_name  length=$300;
  attrib Cat length=$100;
	attrib nr length=8;
  attrib PCT length=8 format=percent8.5 ;
  set freq end=eof;
  var_name=substr(Table,7);
  Cat = vvaluex(cat('F_',var_name));
  nr = frequency;
  pct = percent/100;
  output;
  if eof then do;
    var_name='Total';
    cat=' ';
    nr=cumfrequency;
    pct=cumpercent/100;
    output;
  end;
  keep Var_name -- PCT;
run;

proc print data=_dist_;
run;

PS: Notice that PERCENT8.3 is too short to consistently display 3 decimal places. You forgot to leave space for ( ) that PERCENT format adds around negative values.

Ronein

The question is- about running time?
Which way do you think is lowest running time? Should have significant difference in run time?is proc sql have the history running time??

Tom

@Ronein wrote:
The question is- about running time?
Which way do you think is lowest running time? Should have significant difference in run time?is proc sql have the history running time??

Two points to consider.

The PROCs I mentions should be able process ALL of the variables of interest in one pass thru the dataset. Your method re-reads the dataset once for each variable.

SAS procs have been optimized for performance. It might be very hard for you to write code that is as efficient.

Kurt_Bremser

@Ronein wrote:
The question is- about running time?
Which way do you think is lowest running time? Should have significant difference in run time?is proc sql have the history running time??

Once again, Maxim 4: Try It.

But @Tom 's single step approach must be faster than your 300 SQL steps.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ronein

Please note thst in grouping I used different user defined formats .Did Tom used it?

Kurt_Bremser

So your question is: will the procedures use the formatted values for grouping?

Create a small example dataset and test this.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ronein

My question if better to group in data step and then used the grouped vars in summary proc?

or use the user defined formats on 300 variables in summary proc?

Tom

@Ronein wrote:
Please note thst in grouping I used different user defined formats .Did Tom used it?

If you need to attach formats to the variables that are different than the ones that are permanently attached to them in the dataset then you can add a FORMAT statement to the PROC step.

proc format ;
value age low-12 = '12 and under'
          13-high = '13 and over'
;
value $sex 'M'='Male' 'F'='Female';
run;

ballardw

@Ronein wrote:
The question is- about running time?
Which way do you think is lowest running time? Should have significant difference in run time?is proc sql have the history running time??

In general Proc Freq and Summary are both more efficient than Proc SQL. In many cases faster by order(s) of magnitude. Reloading the data set multiple times alone for each variable would take many times the execution of the other procedures. Since you want percentages and no other procedures than counts then Proc Freq is almost certainly the fastest solution UNLESS you are using multiple grouping variables in multiple combinations i.e. a count by region, polity, sex, ; region and sex; polity and sex. Doing multiples of these might make Proc Summary preferred as the default counting tool.

Consider:

proc summary data=sashelp.cars;
   class  make  type origin drivetrain enginesize cylinders;
   output out=work.carsummary ;
run;

If you look at the _type_ = 1 then _freq_ is total number of observations in the data set with the given number of cylinders in the data. _type_=3 then _freq_ is the number of cylinders for the given Engine size. _type_=5 is the number cylinders given the drive train. _type_=11 is the number of cylinders of the engine for engine size for given value of Origin. Many combinations represented and can be selected by _type_. And custom format for the _TYPE_ variable could make a very useful report tool. The order of variables on the CLASS statement determines the values of the combinations so if the Class statement is changed any dependencies of _type_ have to be considered for selection or reporting.

Different formats applied to different variables will not have an issue with Proc Freq or Summary. If you are using different grouping formats for the same variable that would likely be an issue. Since you don't show any of the custom formats you use in the code we really can't tell what the effects might be.

Ronein

I dont understand , In my code I used

format=percent8.5

What do you recommend me to change here?

Do you recommend me use percent8.5 or percentn8.5?

data test;
length format $50.;
length var $50.;
format='percentn8.5'; var=put(0.10412456435, percentn8.5); output;
format='percent8.5'; var=put(0.10412456435, percent8.5); output;
format='percentn8.5'; var=put(-0.10412456435, percentn8.5); output;
format='percent8.5'; var=put(-0.10412456435, percent8.5); output;
RUN;

Tom

@Ronein wrote:

I dont understand , In my code I used

format=percent8.5

What do you recommend me to change here?

Do you recommend me use percent8.5 or percentn8.5?

data test;
length format $50.;
length var $50.;
format='percentn8.5'; var=put(0.10412456435, percentn8.5); output;
format='percent8.5'; var=put(0.10412456435, percent8.5); output;
format='percentn8.5'; var=put(-0.10412456435, percentn8.5); output;
format='percent8.5'; var=put(-0.10412456435, percent8.5); output;
RUN;

The WIDTH value of the format specification is the total number of character. The DECIMAL value is the number of characters after the decimal place. The PERCENT format needs 2 character for the optional parentheses, one for the required % and one for the optional (depending on if decimal places are requested) decimal point. The percentages in your example program could range between 100% and 0%. So you need three digits to the left of the decimal place. That means you need a width of 12 to be able to have room for 5 decimal places. 4 + 5 + 3 = 12.

534  data _null_;
535    do p=1,.1,.01,.001,.0001,.00001;
536       put p best12. +1 p percent8.5 +1 p percent12.5;
537    end;
538  run;

           1  100.0%   100.00000%
         0.1  10.00%    10.00000%
        0.01  1.000%     1.00000%
       0.001  .1000%     0.10000%
      0.0001  .0100%     0.01000%
     0.00001  .0010%     0.00100%

Ronein

so i need format percent12.5?

data_null__

If you add the MLF option to the CLASS statement SAS will convert numeric variables (AGE) to character variables. This makes it easier to work with &VARLIST in the OUTPUT data. Can even make ARRAY _V &VARLIST;

Also, DESCENDTYPES will output data in VARLIST order.

proc summary data=sashelp.class chartype DESCENDTYPES;
  class &varlist. / missing MLF;
  types () &varlist. ;
  output out=summary;
run;

Kathryn_SAS

One small correction: the option is MLF (short for multilabel format).

summary stats for many vars

Re: summary stats for many vars

Re: summary stats for many vars

Re: summary stats for many vars

Re: summary stats for many vars

Re: summary stats for many vars

Re: summary stats for many vars

Re: summary stats for many vars

Re: summary stats for many vars

Re: summary stats for many vars

Re: summary stats for many vars

Re: summary stats for many vars

Re: summary stats for many vars

Re: summary stats for many vars

Re: summary stats for many vars

The 2025 SAS Hackathon Kicks Off on June 11!

SAS Training: Just a Click Away