Re: Merging 2 data sets and assigning the same record_id based on comm...

Mscarboncopy · Posted 03-12-2025 03:58 PM

Hello.

I have the following 2 data sets with different variables that I need to merge into one. My problem is that the second file ID is different from the first.

The common var is only VAR1. I am not giving examples for what is in the other variables in each set as I do not think they will give me an issue. Just listing the names here for an idea of what my outcome would look like.

Is there a way to reassign a record ID based on a common var like this? I would imagine creating a new ID var - dropping var record_ID and then renaming the new ID ID like in SET1.

I can merge the file by VAR1 but it is not the output I am looking for, as I get 2 records per ID.

SET 1

input ID Var1 P1 $ P2 P3 P4 P5;

Datalines;

1 3456

2 8796

3 2222

4 7689

;

SET 2

Input record_ID VAR1 B1 B2 B3 $ B4 B5;

Datalines;

10 3456

12 8796

15 2222

20 7689

;

What I need is to make record_id 10 12 15 20 become 1 2 3 4, based on that common VAR1 and have all variables in one record. Like this:

ID VAR1 P1 P2 P3 P4 P5 B1 B2 B3 B4 B5

1 3456

2 8796

3 2222

4 7689

Thank you!

himself · Posted 03-12-2025 05:26 PM

Hi you can achieve this in SAS by merging the datasets based on the common variable VAR1 and then reassigning the record_ID to match the ID in SET1. Here's how you can do it:

/* Step 1: Create SET1 */
data SET1;
    input ID Var1 P1 $ P2 P3 P4 P5;
    datalines;
1 3456
2 8796
3 2222
4 7689
;
run;

/* Step 2: Create SET2 */
data SET2;
    input record_ID VAR1 B1 B2 B3 $ B4 B5;
    datalines;
10 3456
12 8796
15 2222
20 7689
;
run;

/* Step 3: Merge datasets based on VAR1 */
proc sql;
    create table merged as
    select a.ID, a.Var1, a.P1, a.P2, a.P3, a.P4, a.P5, 
           b.B1, b.B2, b.B3, b.B4, b.B5
    from SET1 as a
    inner join SET2 as b
    on a.Var1 = b.Var1;
quit;

/* Step 4: Display the merged dataset */
proc print data=merged;
run;

Using PROC SQL to merge datasets in SAS has several advantages:

Flexibility: PROC SQL allows for complex joins and merges, including inner joins, outer joins, and conditional joins, which can be more flexible than traditional DATA step merges.
Simplicity: The syntax for merging datasets using PROC SQL can be simpler and more intuitive, especially for those familiar with SQL.

References

Match Merging Data Files Using PROC SQL

Tom · Posted 03-12-2025 05:26 PM

Assuming that VAR1 uniquely defines the observations in at least one of the two dataset just do a MERGE.

proc sort data=set1;
  by var1;
run;
proc sort data=set2;
  by var1;
run;
data want;
   merge set1 set2;
   by var1;
run;

Mscarboncopy · Posted 03-13-2025 08:22 AM

Thank you @Tom . When I do that I have 2 records per ID in my merged file, as the variables are not all the same aside from Var1. I need a file with only one record per ID.

I tried what Ksharp suggested and it worked, pending troubleshooting why the values from my set 1 are showing up as missing.

Tom · Posted 03-13-2025 08:40 AM

@Mscarboncopy wrote:

Thank you @Tom . When I do that I have 2 records per ID in my merged file, as the variables are not all the same aside from Var1. I need a file with only one record per ID.

I tried what Ksharp suggested and it worked, pending troubleshooting why the values from my set 1 are showing up as missing.

Why do you have 2 records per ID? Did you start with multiple records per ID (or as it was called in the other dataset RECORD_ID?) to begin with? Did you have multiple records per VAR1 in either of the datasets?

What do you want to do to reduce 2 records into one?

I do not see duplicates or missing values with your example set of ID/VAR1 values.

Spoiler

data set1 ;
  input ID Var1 P1 $ P2 P3 P4 P5;
datalines;
1 3456 A 2 3 4 5
2 8796 B 1 2 2 4
3 2222 C 5 6 7 8
4 7689 D 9 1 2 3
;

 
data set2;
  input record_ID VAR1 B1 B2 B3 $ B4 B5;
datalines;
10 3456 1 2 A 3 4
12 8796 5 6 B 7 8
15 2222 9 1 C 2 3
20 7689 4 5 D 6 7
;

proc sort data=set1; by var1; run;
proc sort data=set2; by var1; run;

data want;
  length new_id 8 ;
  merge set1 set2 ;
  by var1;
  new_id = coalesce(record_id,id);
run;

data set1 ; input ID Var1 P1 $ P2 P3 P4 P5; datalines; 1 3456 A 2 3 4 5 2 8796 B 1 2 2 4 3 2222 C 5 6 7 8 4 7689 D 9 1 2 3 ; data set2; input record_ID VAR1 B1 B2 B3 $ B4 B5; datalines; 10 3456 1 2 A 3 4 12 8796 5 6 B 7 8 15 2222 9 1 C 2 3 20 7689 4 5 D 6 7 ; proc sort data=set1; by var1; run; proc sort data=set2; by var1; run; data want; length new_id 8 ; merge set1 set2 ; by var1; new_id = coalesce(record_id,id); run;

Mscarboncopy · Posted 03-13-2025 01:07 PM

It is so odd @Tom. I tried your code and this is what I get. We have data being collected separately and this is why the same participants are getting different IDs. I am trying to merge the files and keep one ID (which is in my example var ID) and one record per participant. When I do it the way Ksharp suggested I get them all in one record but I am missing all the data from Set1

This is your suggestion I will post next what I am getting with Ksharp's idea. Thank you for your patience.

OBS	new_id	ID	Var1	P1	P2	P3	P4	P5	Record_id	B1	B2	B3	B4	B5
1	15		2222						15	9	1	C	2	3
2	10		3456						10	1	2	A	3	4
3	20		7689						20	4	5	D	6	7
4	12		8796						12	5	6	B	7	8
5	3	3	2222	C	5	6	7	8
6	1	1	3456	A	2	3	4	5
7	4	4	7689	D	9	1	2	3
8	2	2	8796	B	1	2	2	4

Mscarboncopy · Posted 03-13-2025 01:08 PM

@Tom @Ksharp What I am getting with Ksharp's code. It is what I need but I need the values to be there as well as the ID.

OBS	ID	Var1	P1	P2	P3	P4	P5	B1	B2	B3	B4	B5
1		2222						9	1	C	2	3
2		3456						1	2	A	3	4
3		7689						4	5	D	6	7
4		8796						5	6	B	7	8

Tom · Posted 03-13-2025 02:41 PM

To get that output then values of VAR1 that LOOK the same like 2222 must not actually BE the same.

So if VAR1 is CHARACTER then one of the two values has leading spaces. SAS will ignore trailing spaces but leading spaces are not ignored.

'   2222' ne '2222'

NOTE: If you print the values using ODS then the leading spaces are removed by ODS so you cannot see them in the report. You can attach the $QUOTE format to the variable. Then leading spaces will be apparent in ODS output.

You can fix this by applying the LEFT() function to move the leading spaces to trailing spaces.

If VAR1 is NUMERIC then there is some small difference in the values. If the values should be integers then you can apply the ROUND() function so the values match what was printed. If the integers use more than 16 digits then you have values that are too large to be stored uniquely as numbers in SAS. Convert them to character variables instead.

I can reproduce your output by adding leading spaces to VAR1 in SET1.

data set1 ;
  input ID Var1 $ P1 $ P2 P3 P4 P5;
  var1=' '||var1;
datalines;
1 3456 A 2 3 4 5
2 8796 B 1 2 2 4
3 2222 C 5 6 7 8
4 7689 D 9 1 2 3
;


data set2;
  input record_ID VAR1 $ B1 B2 B3 $ B4 B5;
datalines;
10 3456 1 2 A 3 4
12 8796 5 6 B 7 8
15 2222 9 1 C 2 3
20 7689 4 5 D 6 7
;

proc sort data=set1; by var1; run;
proc sort data=set2; by var1; run;

data want;
  length new_id 8 ;
  merge set1 set2 ;
  by var1;
  new_id = coalesce(record_id,id);
run;

proc print;
run;

Mscarboncopy · Posted 03-14-2025 08:12 AM

@TomThank you again. I did not think of this at all. Var1 is actually a string so that was exactly the issue. I transformed it in Numeric and that seems to have solved the issue. I am assuming transforming into numeric is the solution for any trailing? Am I right? As participants can easily add a leading space here and there.

I like it that there are different ways to reach the same result. I have to investigate a little more as with your code I am getting an extra ID for some reason, as opposed to when I use the code from KSharp. But I am pulling from my data not the example I gave. Thank you for your guidance. It is really appreciated.

Tom · Posted 03-14-2025 05:57 PM

Converting to numbers will work as long has you do not have one of these conditions:

Significant leading zeros. ID=0001 is not the same entity as ID=1.
Values that are larger than the SAS can store uniquely as a number.

1    data _null_;
2      maxint = constant('exactint');
3      put maxint=comma23.;
4    run;

maxint=9,007,199,254,740,992

Mscarboncopy · Posted 03-15-2025 11:15 AM

Thank you!

Mscarboncopy · Posted 03-17-2025 11:11 AM

@Tom @Ksharp Another question. I added different numbers on VAR1 and noticed that this one gave me an issue: 657899654.

This is curious:

1. Using Tom's code it applied the correct ID but it created another record with a round up number of 700000000 with a blank ID. The record that get the ID does not have the data that would be coming from set 2. The extra record does.

2. If I use Ksharp's code, no extra record is created but that specific record does not get the ID, ID is blank and the merging variable number VAR1 is transformed to 700000000. All other variables are in, as they should.

It looks like numbers up to 7 digits work fine. I did not try 8 digits, is it safe to say that up to 8 digits it would work fine? Is the issue here that the number is 9 digit?

Why do the codes not work for that specific Var1 # ?

I can make sure the VAR1 number is not >7 or 8 digit, if that is the issue.

I wanted to understand what is happening before moving forward.

Why would SAS round up that particular Var 1 value (on both codes) and split the data on 1 and not assign the ID on 2? It did not do the same for any of the other values I entered (up to 7 digits).

Thanks.

Tom · Posted 03-17-2025 12:53 PM

Need to see what code you actually used. Is VAR1 numeric or character? How did you create the datasets? Did you run a data step to read in text lines? From a file? From in-line cards/datalines? Did you read in some existing dataset? What was the source?

I cannot think on any operations that would round 657,899,654 to 700,000,000 other than rounding to 100 millions place.

var1=round(var1,100000000);

Mscarboncopy · Posted 03-18-2025 02:27 PM

I transformed VAR1 that is being collected as character into numeric to solve my issue with leading spaces for the code you suggested I use for the merge/re-coding of the var ID.

This is fake data I am entering, the data is coming from a REDCap form that is being exported as a SAS file. I use the SAS files to run through my code (The two files I mentioned before set 1 and set 2). I did not enter the values in datalines.

In the code you suggested I use, you mentioned something about a possible issue coming from a number that is too long?

Could that have been it?

Tom · Posted 03-18-2025 02:52 PM

@Mscarboncopy wrote:

I transformed VAR1 that is being collected as character into numeric to solve my issue with leading spaces for the code you suggested I use for the merge/re-coding of the var ID.

This is fake data I am entering, the data is coming from a REDCap form that is being exported as a SAS file. I use the SAS files to run through my code (The two files I mentioned before set 1 and set 2). I did not enter the values in datalines.

In the code you suggested I use, you mentioned something about a possible issue coming from a number that is too long?

Could that have been it?

You should not see issues with dealing with integer values of that length with SAS. Not sure a about what REDCAP is doing internally or how they attempt to convert what they have internally into SAS datasets. Perhaps the field length in REDCAP is limited in some way causing some truncation? Perhaps REDCAP's idea of how to convert what they have into SAS datasets is flawed?

Look at the values for the two observations that you think should match before your attempt to convert them on the SAS side into numbers. To really see what character variables have use the $HEX format. That will display each byte as two hexadecimal digits. A space will be show as hex code 20. The digits will be hexcodes 30 to 39. Any other hexcodes will cause trouble.

You can get some "rounding" if you where to write out that number using SAS's BEST format without enough characters to represent all of the value. To get only one digit you would have to use BEST3. format, which would result in displaying 7E8 which when converted back into a number would become 7 with 8 zeros.

22   data test;
23     string='657,899,654';
24     number1=input(string,comma32.);
25     string2=put(number1,best3.);
26     number2=input(string2,32.);
27     put (_all_) (=/);
28   run;


string=657,899,654
number1=657899654
string2=7E8
number2=700000000

Merging 2 data sets and assigning the same record_id based on common var

Re: Merging 2 data sets and assigning the same record_id based on common var

Re: Merging 2 data sets and assigning the same record_id based on common var

Re: Merging 2 data sets and assigning the same record_id based on common var

Re: Merging 2 data sets and assigning the same record_id based on common var

Re: Merging 2 data sets and assigning the same record_id based on common var

Re: Merging 2 data sets and assigning the same record_id based on common var

Re: Merging 2 data sets and assigning the same record_id based on common var

Re: Merging 2 data sets and assigning the same record_id based on common var

Re: Merging 2 data sets and assigning the same record_id based on common var

Re: Merging 2 data sets and assigning the same record_id based on common var

Re: Merging 2 data sets and assigning the same record_id based on common var

Re: Merging 2 data sets and assigning the same record_id based on common var

Re: Merging 2 data sets and assigning the same record_id based on common var

Re: Merging 2 data sets and assigning the same record_id based on common var

The 2025 SAS Hackathon Kicks Off on June 11!

SAS Training: Just a Click Away