Trap when merging, then redefining variable

BruceBrad · Posted 01-29-2025 06:40 AM

I've been using SAS for many years, but never came across this trap before. Surely it deserves a warning message in the log!

I can see now how the problem arises when I think through how SAS does merges (by retaining variables on the 1 side of a 1:n merge). The solution is to never redefine variables in your data step when merging (if the variable comes from the 1: side of the merge) - even if it might be convenient for doing things like converting cents to $.

data file1;
input id var;
cards;
1 10
2 20
;
data file2;
input id;
cards;
1 
1
2
2
2
;
data out1;
merge file1 file2;
by id;
var = var/100;  /* this gets calculated as if var were retained for each by value */
run;
proc print;

ChrisNZ · Posted 01-29-2025 07:14 AM

The value of a variable is only reset to missing in the data step's PDV at the start of a data step if

- it is not a data set variable, and

- it is not a retained variable (RETAIN statement or increment syntax).

The value of a data set variable is refreshed in the PDV when a value is read from the data set.

When none of these condition are true (data set variable, but no value read for the data set), the value in the PDV remains unchanged.

People who don't know that have not been trained properly. Not blaming them, it's very common unfortunately.

The PDV logic also explains why a Cartesian product does not take place when merging data sets in a data step.

I'll take the opportunity to complain again that the message

 NOTE: MERGE statement has more than one data set with repeats of BY values.

should never have been a note, and should now be customisable to a WARNING or an ERROR, with an option similar to these:

  DKRICOND=ERROR    Specifies the error level to report when a variable is missing from an input data set during the processing of a 
                    DROP=, KEEP=, or RENAME= data set option.
  DKROCOND=WARN     Specifies the error level to report when a variable is missing from an output data set during the processing of 
                    a DROP=, KEEP=, or RENAME= data set option.

High-Performance SAS Coding - Third Edition

Tom · Posted 01-29-2025 08:55 AM

That is why 1 to many merges actually work.

Just do you calculations on the first observation of the by group.

data out1;
  merge file1 file2;
  by id;
  if first.id then var = var/100;  
run;

Astounding · Posted 01-29-2025 09:33 AM

Here are some ancient papers that discuss this and similar issues.

Here's an ancient paper that discusses your original post plus more:

How MERGE Really Works

https://www.lexjansen.com/nesug/nesug99/ad/ad155.pdf

And here's a slightly less ancient paper that discusses other bizarre results with MERGE:

Danger: MERGE Ahead! Warning: BY Variable with Multiple Lengths

https://support.sas.com/resources/papers/proceedings/proceedings/sugi28/098-28.pdf

Tom · Posted 01-29-2025 10:47 AM

Both very good papers.

One thing I noticed in the first one that I would change is this paragraph.

The role of the PDV clears up confusing combinations
of KEEPs, DROPs, and RENAMEs. All KEEPs,
DROPs, and RENAMEs on a DATA statement refer to
variable names in the PDV. All KEEPs, DROPs, and
RENAMEs on a SET, MERGE, or UPDATE statement
refer to variable names in the source data set. So
when will the following program work?

I like to include the required = when referring to dataset options to make it clearer to the reader that the KEEP= dataset option is a different thing than the KEEP data step statement.

And in the second one it is working to hard to add unneeded commas between the values when generating the list of values for the IN operator.

memname in
("MALES", "FEMALES", "ANDROIDS")

SAS is just as happy with spaces between the values:

memname in ("MALES" "FEMALES" "ANDROIDS")

which is a lot easier to generate in macro code.

Ksharp · Posted 01-29-2025 10:06 PM

1)

Yeah. That is an interesting problem.
As you said ,the problem is from VAR is retained (since it is from a dataset).
first var=var/100 ==>10/100= 0.1
second var=var/100 ==> 0.1/100=0.001
......and so on.

That violated our intuition inside. I suggest to use TWO dataset to avoid this problem:

data file1;
input id var;
cards;
1 10
2 20
;
data file2;
input id;
cards;
1
1
2
2
2
;
data out1;
merge file1 file2;
by id;
run;
data out1;
set out1;
var = var/100;
run;
proc print;run;

2)

Another problem I am running into when I use MERGE and two datasets have a variable in common :

data file1;
input id var;
cards;
1 10
2 20
;
data file2;
input id var;
cards;
1 9
1 9
2 99
2 99
2 99
;
data out1;
merge file2 file1;
by id;
run;

proc print;run;

grace_sas · Posted 01-30-2025 01:07 PM

Here are a couple other resources that explain what happens during merges.

Usage Note 48705: A one-to-many merge with common variables that are not the BY variables will have ...

Data Step Concepts

Hope that helps,

Grace

Trap when merging, then redefining variable

Re: Trap when merging, then redefining variable

Re: Trap when merging, then redefining variable

Re: Trap when merging, then redefining variable

Re: Trap when merging, then redefining variable

Re: Trap when merging, then redefining variable

Re: Trap when merging, then redefining variable

Trap when merging, then redefining variable

Re: Trap when merging, then redefining variable

Re: Trap when merging, then redefining variable

Re: Trap when merging, then redefining variable

Re: Trap when merging, then redefining variable

Re: Trap when merging, then redefining variable

Re: Trap when merging, then redefining variable

The 2025 SAS Hackathon Kicks Off on June 11!

SAS Training: Just a Click Away