@Patrick wrote:
BUT using md5 instead of sha-256 will increase the collision risk which with your data volume isn't negligeable.
According to simple approximation formulas (assuming the MD5 digests are uniformly distributed random strings, which might be too optimistic, but I'm not sure), the collision probability for @kenkaran's 1.8E9 keys should be approx. 5E-21, i.e., extremely small (see, e.g., https://towardsdatascience.com/collision-risk-in-hash-based-surrogate-keys-4c87b716cbcd/).
@Patrick wrote:
I believe to remember that at one point there was an issue with the hash that when one didn't define Data the key variables got used as Data doubling the required memory.
Adding the 1-byte dummy data item won't hurt. Tests on my Windows SAS 9.4M5 suggest, however, that for keys with length 16 there is no decrease in item_size by doing so. (The benefit starts at length 17.)
@Patrick wrote:
Now... If your key variable clm_id contains a 64 character hex string then that's a base16 value. Another way for shortening the string without increasing the collision risk could be to convert this base16 value to a base32 value.
I would rather favor base 256, as it is both simpler to obtain -- the $HEX64. informat does the conversion -- and more effective: The string length is halved to 32 bytes (as opposed to 52 bytes with base 32).
Will half of the header table fit into memory? A third? I'd first try to chunk the header table and run a hash lookup for each chunk. Plus, it might give you a decent idea as to how much memory would be needed for the whole header table.
@DerylHollick wrote:
Will half of the header table fit into memory? A third? I'd first try to chunk the header table and run a hash lookup for each chunk. Plus, it might give you a decent idea as to how much memory would be needed for the whole header table.
This is what my suggested code does. But instead of just loading any old half of the header (or in my example a fifth), I proposed selecting a half (or a fifth) for a given range of join variable values - on both datasets. This makes the number of needed join comparisons about one fourth (in the case of halves) or one 25th (for fifths) for each subgroup join - a very effective reduction especially when the process "outsources" the subgroup selection to the data engine. True, one has to process the big dataset for each subgroup join, but that burden is significantly reduced by outsourcing the subgroup filtering to the data engine, via the WHERE options.
I am using two data sets (small and big) to illustrate one method of solving your problem.
data small;
input id $8.;
datalines;
Rama
Seetha
Sras
Gopal
John
;
run;
data big;
input id $8. amount;
datalines;
Seetha 100
Rama 200
Gopal 500
Krishna 300
John 400
Anbu 500
Kachi 500
Lakshi 600
asdfgh 700
ordsfg 600
pqwers 600
kasert 700
lasert 800
Anbu 100
Rama 100
Gopal 400
;
run;
I am keeping only essential variables for sorting.
I am holding ID from the BIG data set and add rowid(_N_)
This new data set is small enough to sort.
proc sort data = small;
by id;
run;
data tempbig;
set big(keep = id);
RID = _N_;
run;
proc sort data = tempbig;
by id;
run;
Next I merge the 2 data sets, SMALL and TEMPBIG.
data temp;
merge small(in = a) tempbig(in = b);
by id;
if a and b;
run;
Next, using POINT = option of SET statement, the matched records from BIG are output.
data want;
do i = 1 by 1 until(eof);
set temp(keep = rid) end = eof;
p = rid;
set big point = p;
output;
end;
stop;
drop i rid;
run;
The output data set is:
Obs id amount
1 Gopal 500
2 Gopal 400
3 John 400
4 Rama 200
5 Rama 100
6 Seetha 100
Let T = duration for flat read of detail table D
Let K = number of header keys that _can_ fit in a hash table.
Do 1,849,842,886 / K data step reads through D with hash lookup selectiond.
Append selections of each run-through.
Guessing conservatively and presuming 70 GB set aside for a session there should be enough memory for a hash table that holds 500K 64byte keys. So maybe 6 reads through D. Make that a worst case of 10 read throughs. so 10 * (T + T-out) (writing key matched records)
If this needs to be done more than one time you might want to also code a solution in Proc DS2 that uses THREADs and compare resource and time consumptions to DATA step.
@RichardAD wrote:
Let T = duration for flat read of detail table D
Let K = number of header keys that _can_ fit in a hash table.
Do 1,849,842,886 / K data step reads through D with hash lookup selectiond.
Append selections of each run-through.
Guessing conservatively and presuming 70 GB set aside for a session there should be enough memory for a hash table that holds 500K 64byte keys. So maybe 6 reads through D. Make that a worst case of 10 read throughs. so 10 * (T + T-out) (writing key matched records)
If this needs to be done more than one time you might want to also code a solution in Proc DS2 that uses THREADs and compare resource and time consumptions to DATA step.
You don't need that much input output activity.
If after examining the header dataset in comparison to your available memory for hash object you find that you have to do 10 subgroup joins, you don't have to have generate input/output totaling 10*(T +Tout). It can be reduced to 3*T + 1*Tout. Divide the header into 10 subgroups based on the value of the join variable. Then, given you have disk space, also divide the detail dataset into 10 smaller datasets using the same join variable. That can be done in one DATA step totaling 2*T input/output. Then each of the 10 subgroup joins will need only 0.1*(T+ ~0.1*Tout). You can save even more my creating the detail subgroups containing only the variables of interest.
Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.