BookmarkSubscribeRSS Feed
BruceBrad
Lapis Lazuli | Level 10

I've been using SAS for many years, but never came across this trap before. Surely it deserves a warning message in the log!

I can see now how the problem arises when I think through how SAS does merges (by retaining variables on the 1 side of a 1:n merge). The solution is to never redefine variables in your data step when merging (if the variable comes from the 1: side of the merge) - even if it might be convenient for doing things like converting cents to $.

 

data file1;
input id var;
cards;
1 10
2 20
;
data file2;
input id;
cards;
1 
1
2
2
2
;
data out1;
merge file1 file2;
by id;
var = var/100;  /* this gets calculated as if var were retained for each by value */
run;
proc print;
6 REPLIES 6
ChrisNZ
Tourmaline | Level 20

 

The value of a variable is only reset to missing in the data step's PDV at the start of a data step if

- it is not a data set variable, and

- it is not a retained variable (RETAIN statement or increment syntax).

 

The value of a data set variable is refreshed in the PDV when a value is read from the data set.

 

When none of these condition are true (data set variable, but no value read for the data set), the value in the PDV remains unchanged.

 

People who don't know that have not been trained properly. Not blaming them, it's very common unfortunately.

The PDV logic also explains why a Cartesian product does not take place when merging data sets in a data step.

 

I'll take the opportunity to complain again that the message

 NOTE: MERGE statement has more than one data set with repeats of BY values. 

should never have been a note, and should now be customisable to a WARNING or an ERROR, with an option similar to these:

  DKRICOND=ERROR    Specifies the error level to report when a variable is missing from an input data set during the processing of a 
                    DROP=, KEEP=, or RENAME= data set option.
  DKROCOND=WARN     Specifies the error level to report when a variable is missing from an output data set during the processing of 
                    a DROP=, KEEP=, or RENAME= data set option.

 

Tom
Super User Tom
Super User

That is why 1 to many merges actually work.

 

Just do you calculations on the first observation of the by group.

data out1;
  merge file1 file2;
  by id;
  if first.id then var = var/100;  
run;
Astounding
PROC Star

Here are some ancient papers that discuss this and similar issues.

 

Here's an ancient paper that discusses your original post plus more:

How MERGE Really Works

https://www.lexjansen.com/nesug/nesug99/ad/ad155.pdf

 

And here's a slightly less ancient paper that discusses other bizarre results with MERGE:

Danger:  MERGE Ahead!  Warning:  BY Variable with Multiple Lengths

https://support.sas.com/resources/papers/proceedings/proceedings/sugi28/098-28.pdf

 

 

Tom
Super User Tom
Super User

Both very good papers.

 

One thing I noticed in the first one that I would change is this paragraph.

 

The role of the PDV clears up confusing combinations
of KEEPs, DROPs, and RENAMEs. All KEEPs,
DROPs, and RENAMEs on a DATA statement refer to
variable names in the PDV. All KEEPs, DROPs, and
RENAMEs on a SET, MERGE, or UPDATE statement
refer to variable names in the source data set. So
when will the following program work?

I like to include the required = when referring to dataset options to make it clearer to the reader that the KEEP= dataset option is a different thing than the KEEP data step statement.

 

And in the second one it is working to hard to add unneeded commas between the values when generating the list of values for the IN operator. 

 

memname in
("MALES", "FEMALES", "ANDROIDS")

SAS is just as happy with spaces between the values:

memname in ("MALES" "FEMALES" "ANDROIDS")

which is a lot easier to generate in macro code.

 

 

 

 

Ksharp
Super User

1)

Yeah. That is an interesting problem.
As you said ,the problem is from VAR is retained (since it is from a dataset).
first var=var/100 ==>10/100= 0.1
second var=var/100 ==> 0.1/100=0.001
......and so on.

That violated our intuition inside. I suggest to use TWO dataset to avoid this problem:

data file1;
input id var;
cards;
1 10
2 20
;
data file2;
input id;
cards;
1
1
2
2
2
;
data out1;
merge file1 file2;
by id;
run;
data out1;
set out1;
var = var/100;
run;
proc print;run;

 

 

 

2)

Another problem I am running into when I use MERGE and two datasets have a variable in common :

data file1;
input id var;
cards;
1 10
2 20
;
data file2;
input id var;
cards;
1 9
1 9
2 99
2 99
2 99
;
data out1;
merge file2 file1;
by id;
run;

proc print;run;

Ksharp_1-1738206818971.png

 

grace_sas
SAS Employee

Here are a couple other resources that explain what happens during merges.

 

Usage Note 48705: A one-to-many merge with common variables that are not the BY variables will have ... 

Data Step Concepts 

 

Hope that helps,

Grace 

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 1228 views
  • 13 likes
  • 6 in conversation
OSZAR »