Hello.
I have the following 2 data sets with different variables that I need to merge into one. My problem is that the second file ID is different from the first.
The common var is only VAR1. I am not giving examples for what is in the other variables in each set as I do not think they will give me an issue. Just listing the names here for an idea of what my outcome would look like.
Is there a way to reassign a record ID based on a common var like this? I would imagine creating a new ID var - dropping var record_ID and then renaming the new ID ID like in SET1.
I can merge the file by VAR1 but it is not the output I am looking for, as I get 2 records per ID.
SET 1
input ID Var1 P1 $ P2 P3 P4 P5;
Datalines;
1 3456
2 8796
3 2222
4 7689
;
SET 2
Input record_ID VAR1 B1 B2 B3 $ B4 B5;
Datalines;
10 3456
12 8796
15 2222
20 7689
;
What I need is to make record_id 10 12 15 20 become 1 2 3 4, based on that common VAR1 and have all variables in one record. Like this:
ID VAR1 P1 P2 P3 P4 P5 B1 B2 B3 B4 B5
1 3456
2 8796
3 2222
4 7689
Thank you!
Hi you can achieve this in SAS by merging the datasets based on the common variable VAR1
and then reassigning the record_ID
to match the ID
in SET1
. Here's how you can do it:
/* Step 1: Create SET1 */
data SET1;
input ID Var1 P1 $ P2 P3 P4 P5;
datalines;
1 3456
2 8796
3 2222
4 7689
;
run;
/* Step 2: Create SET2 */
data SET2;
input record_ID VAR1 B1 B2 B3 $ B4 B5;
datalines;
10 3456
12 8796
15 2222
20 7689
;
run;
/* Step 3: Merge datasets based on VAR1 */
proc sql;
create table merged as
select a.ID, a.Var1, a.P1, a.P2, a.P3, a.P4, a.P5,
b.B1, b.B2, b.B3, b.B4, b.B5
from SET1 as a
inner join SET2 as b
on a.Var1 = b.Var1;
quit;
/* Step 4: Display the merged dataset */
proc print data=merged;
run;
Using PROC SQL to merge datasets in SAS has several advantages:
References
Match Merging Data Files Using PROC SQL
Assuming that VAR1 uniquely defines the observations in at least one of the two dataset just do a MERGE.
proc sort data=set1;
by var1;
run;
proc sort data=set2;
by var1;
run;
data want;
merge set1 set2;
by var1;
run;
Thank you @Tom . When I do that I have 2 records per ID in my merged file, as the variables are not all the same aside from Var1. I need a file with only one record per ID.
I tried what Ksharp suggested and it worked, pending troubleshooting why the values from my set 1 are showing up as missing.
@Mscarboncopy wrote:
Thank you @Tom . When I do that I have 2 records per ID in my merged file, as the variables are not all the same aside from Var1. I need a file with only one record per ID.
I tried what Ksharp suggested and it worked, pending troubleshooting why the values from my set 1 are showing up as missing.
Why do you have 2 records per ID? Did you start with multiple records per ID (or as it was called in the other dataset RECORD_ID?) to begin with? Did you have multiple records per VAR1 in either of the datasets?
What do you want to do to reduce 2 records into one?
I do not see duplicates or missing values with your example set of ID/VAR1 values.
data set1 ;
input ID Var1 P1 $ P2 P3 P4 P5;
datalines;
1 3456 A 2 3 4 5
2 8796 B 1 2 2 4
3 2222 C 5 6 7 8
4 7689 D 9 1 2 3
;
data set2;
input record_ID VAR1 B1 B2 B3 $ B4 B5;
datalines;
10 3456 1 2 A 3 4
12 8796 5 6 B 7 8
15 2222 9 1 C 2 3
20 7689 4 5 D 6 7
;
proc sort data=set1; by var1; run;
proc sort data=set2; by var1; run;
data want;
length new_id 8 ;
merge set1 set2 ;
by var1;
new_id = coalesce(record_id,id);
run;
It is so odd @Tom. I tried your code and this is what I get. We have data being collected separately and this is why the same participants are getting different IDs. I am trying to merge the files and keep one ID (which is in my example var ID) and one record per participant. When I do it the way Ksharp suggested I get them all in one record but I am missing all the data from Set1
This is your suggestion I will post next what I am getting with Ksharp's idea. Thank you for your patience.
OBS | new_id | ID | Var1 | P1 | P2 | P3 | P4 | P5 | Record_id | B1 | B2 | B3 | B4 | B5 |
1 | 15 | 2222 | 15 | 9 | 1 | C | 2 | 3 | ||||||
2 | 10 | 3456 | 10 | 1 | 2 | A | 3 | 4 | ||||||
3 | 20 | 7689 | 20 | 4 | 5 | D | 6 | 7 | ||||||
4 | 12 | 8796 | 12 | 5 | 6 | B | 7 | 8 | ||||||
5 | 3 | 3 | 2222 | C | 5 | 6 | 7 | 8 | ||||||
6 | 1 | 1 | 3456 | A | 2 | 3 | 4 | 5 | ||||||
7 | 4 | 4 | 7689 | D | 9 | 1 | 2 | 3 | ||||||
8 | 2 | 2 | 8796 | B | 1 | 2 | 2 | 4 |
|
To get that output then values of VAR1 that LOOK the same like 2222 must not actually BE the same.
So if VAR1 is CHARACTER then one of the two values has leading spaces. SAS will ignore trailing spaces but leading spaces are not ignored.
' 2222' ne '2222'
NOTE: If you print the values using ODS then the leading spaces are removed by ODS so you cannot see them in the report. You can attach the $QUOTE format to the variable. Then leading spaces will be apparent in ODS output.
You can fix this by applying the LEFT() function to move the leading spaces to trailing spaces.
If VAR1 is NUMERIC then there is some small difference in the values. If the values should be integers then you can apply the ROUND() function so the values match what was printed. If the integers use more than 16 digits then you have values that are too large to be stored uniquely as numbers in SAS. Convert them to character variables instead.
I can reproduce your output by adding leading spaces to VAR1 in SET1.
data set1 ;
input ID Var1 $ P1 $ P2 P3 P4 P5;
var1=' '||var1;
datalines;
1 3456 A 2 3 4 5
2 8796 B 1 2 2 4
3 2222 C 5 6 7 8
4 7689 D 9 1 2 3
;
data set2;
input record_ID VAR1 $ B1 B2 B3 $ B4 B5;
datalines;
10 3456 1 2 A 3 4
12 8796 5 6 B 7 8
15 2222 9 1 C 2 3
20 7689 4 5 D 6 7
;
proc sort data=set1; by var1; run;
proc sort data=set2; by var1; run;
data want;
length new_id 8 ;
merge set1 set2 ;
by var1;
new_id = coalesce(record_id,id);
run;
proc print;
run;
@TomThank you again. I did not think of this at all. Var1 is actually a string so that was exactly the issue. I transformed it in Numeric and that seems to have solved the issue. I am assuming transforming into numeric is the solution for any trailing? Am I right? As participants can easily add a leading space here and there.
I like it that there are different ways to reach the same result. I have to investigate a little more as with your code I am getting an extra ID for some reason, as opposed to when I use the code from KSharp. But I am pulling from my data not the example I gave. Thank you for your guidance. It is really appreciated.
Converting to numbers will work as long has you do not have one of these conditions:
1 data _null_; 2 maxint = constant('exactint'); 3 put maxint=comma23.; 4 run; maxint=9,007,199,254,740,992
Thank you!
@Tom @Ksharp Another question. I added different numbers on VAR1 and noticed that this one gave me an issue: 657899654.
This is curious:
1. Using Tom's code it applied the correct ID but it created another record with a round up number of 700000000 with a blank ID. The record that get the ID does not have the data that would be coming from set 2. The extra record does.
2. If I use Ksharp's code, no extra record is created but that specific record does not get the ID, ID is blank and the merging variable number VAR1 is transformed to 700000000. All other variables are in, as they should.
It looks like numbers up to 7 digits work fine. I did not try 8 digits, is it safe to say that up to 8 digits it would work fine? Is the issue here that the number is 9 digit?
Why do the codes not work for that specific Var1 # ?
I can make sure the VAR1 number is not >7 or 8 digit, if that is the issue.
I wanted to understand what is happening before moving forward.
Why would SAS round up that particular Var 1 value (on both codes) and split the data on 1 and not assign the ID on 2? It did not do the same for any of the other values I entered (up to 7 digits).
Thanks.
Need to see what code you actually used. Is VAR1 numeric or character? How did you create the datasets? Did you run a data step to read in text lines? From a file? From in-line cards/datalines? Did you read in some existing dataset? What was the source?
I cannot think on any operations that would round 657,899,654 to 700,000,000 other than rounding to 100 millions place.
var1=round(var1,100000000);
I transformed VAR1 that is being collected as character into numeric to solve my issue with leading spaces for the code you suggested I use for the merge/re-coding of the var ID.
This is fake data I am entering, the data is coming from a REDCap form that is being exported as a SAS file. I use the SAS files to run through my code (The two files I mentioned before set 1 and set 2). I did not enter the values in datalines.
In the code you suggested I use, you mentioned something about a possible issue coming from a number that is too long?
Could that have been it?
@Mscarboncopy wrote:
I transformed VAR1 that is being collected as character into numeric to solve my issue with leading spaces for the code you suggested I use for the merge/re-coding of the var ID.
This is fake data I am entering, the data is coming from a REDCap form that is being exported as a SAS file. I use the SAS files to run through my code (The two files I mentioned before set 1 and set 2). I did not enter the values in datalines.
In the code you suggested I use, you mentioned something about a possible issue coming from a number that is too long?
Could that have been it?
You should not see issues with dealing with integer values of that length with SAS. Not sure a about what REDCAP is doing internally or how they attempt to convert what they have internally into SAS datasets. Perhaps the field length in REDCAP is limited in some way causing some truncation? Perhaps REDCAP's idea of how to convert what they have into SAS datasets is flawed?
Look at the values for the two observations that you think should match before your attempt to convert them on the SAS side into numbers. To really see what character variables have use the $HEX format. That will display each byte as two hexadecimal digits. A space will be show as hex code 20. The digits will be hexcodes 30 to 39. Any other hexcodes will cause trouble.
You can get some "rounding" if you where to write out that number using SAS's BEST format without enough characters to represent all of the value. To get only one digit you would have to use BEST3. format, which would result in displaying 7E8 which when converted back into a number would become 7 with 8 zeros.
22 data test; 23 string='657,899,654'; 24 number1=input(string,comma32.); 25 string2=put(number1,best3.); 26 number2=input(string2,32.); 27 put (_all_) (=/); 28 run; string=657,899,654 number1=657899654 string2=7E8 number2=700000000
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.