Solved: Join vs Merge 1.2 TB with 110 GB Datasets - Page 2

FreelanceReinh · Posted 04-15-2025 05:55 AM

@Patrick wrote:

BUT using md5 instead of sha-256 will increase the collision risk which with your data volume isn't negligeable.

According to simple approximation formulas (assuming the MD5 digests are uniformly distributed random strings, which might be too optimistic, but I'm not sure), the collision probability for @kenkaran's 1.8E9 keys should be approx. 5E-21, i.e., extremely small (see, e.g., https://towardsdatascience.com/collision-risk-in-hash-based-surrogate-keys-4c87b716cbcd/).

@Patrick wrote:

I believe to remember that at one point there was an issue with the hash that when one didn't define Data the key variables got used as Data doubling the required memory.

Adding the 1-byte dummy data item won't hurt. Tests on my Windows SAS 9.4M5 suggest, however, that for keys with length 16 there is no decrease in item_size by doing so. (The benefit starts at length 17.)

@Patrick wrote:

Now... If your key variable clm_id contains a 64 character hex string then that's a base16 value. Another way for shortening the string without increasing the collision risk could be to convert this base16 value to a base32 value.

I would rather favor base 256, as it is both simpler to obtain -- the $HEX64. informat does the conversion -- and more effective: The string length is halved to 32 bytes (as opposed to 52 bytes with base 32).

DerylHollick · Posted 04-15-2025 09:25 AM

Will half of the header table fit into memory? A third? I'd first try to chunk the header table and run a hash lookup for each chunk. Plus, it might give you a decent idea as to how much memory would be needed for the whole header table.

mkeintz · Posted 04-15-2025 06:46 PM

@DerylHollick wrote:

Will half of the header table fit into memory? A third? I'd first try to chunk the header table and run a hash lookup for each chunk. Plus, it might give you a decent idea as to how much memory would be needed for the whole header table.

This is what my suggested code does. But instead of just loading any old half of the header (or in my example a fifth), I proposed selecting a half (or a fifth) for a given range of join variable values - on both datasets. This makes the number of needed join comparisons about one fourth (in the case of halves) or one 25th (for fifths) for each subgroup join - a very effective reduction especially when the process "outsources" the subgroup selection to the data engine. True, one has to process the big dataset for each subgroup join, but that burden is significantly reduced by outsourcing the subgroup filtering to the data engine, via the WHERE options.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

KachiM

I am using two data sets (small and big) to illustrate one method of solving your problem.

data small;
input id $8.;
datalines;
Rama
Seetha
Sras
Gopal
John
;
run;



data big;
input id $8. amount;
datalines;
Seetha   100
Rama     200
Gopal    500
Krishna  300
John     400
Anbu     500
Kachi    500
Lakshi   600
asdfgh   700
ordsfg   600
pqwers   600
kasert   700
lasert   800
Anbu     100
Rama     100
Gopal    400
;
run;

I am keeping only essential variables for sorting.

the BIG data set and add rowid(_N_)

This new data set is small enough to sort.

proc sort data = small;
by id;
run;

data tempbig;
   set big(keep = id);
   RID = _N_;
run;

proc sort data = tempbig;
by id;
run;

Next I merge the 2 data sets, SMALL and TEMPBIG.

data temp;
   merge small(in = a) tempbig(in = b);
   by id;
   if a and b;
run;

Next, using POINT = option of SET statement, the matched records from BIG are output.

data want;
   do i = 1 by 1 until(eof);
      set temp(keep = rid)  end = eof;
      p = rid;
      set big point = p;
      output;
   end;
   stop;
   drop i rid;
run;

The output data set is:

Obs id amount
1 Gopal 500
2 Gopal 400
3 John 400
4 Rama 200
5 Rama 100
6 Seetha 100

RichardAD

Let T = duration for flat read of detail table D

Let K = number of header keys that _can_ fit in a hash table.

Do 1,849,842,886 / K data step reads through D with hash lookup selectiond.

Append selections of each run-through.

Guessing conservatively and presuming 70 GB set aside for a session there should be enough memory for a hash table that holds 500K 64byte keys. So maybe 6 reads through D. Make that a worst case of 10 read throughs. so 10 * (T + T-out) (writing key matched records)

If this needs to be done more than one time you might want to also code a solution in Proc DS2 that uses THREADs and compare resource and time consumptions to DATA step.

mkeintz

@RichardAD wrote:

Let T = duration for flat read of detail table D

Let K = number of header keys that _can_ fit in a hash table.

Do 1,849,842,886 / K data step reads through D with hash lookup selectiond.

Append selections of each run-through.

Guessing conservatively and presuming 70 GB set aside for a session there should be enough memory for a hash table that holds 500K 64byte keys. So maybe 6 reads through D. Make that a worst case of 10 read throughs. so 10 * (T + T-out) (writing key matched records)

If this needs to be done more than one time you might want to also code a solution in Proc DS2 that uses THREADs and compare resource and time consumptions to DATA step.

You don't need that much input output activity.

If after examining the header dataset in comparison to your available memory for hash object you find that you have to do 10 subgroup joins, you don't have to have generate input/output totaling 10*(T +Tout). It can be reduced to 3*T + 1*Tout. Divide the header into 10 subgroups based on the value of the join variable. Then, given you have disk space, also divide the detail dataset into 10 smaller datasets using the same join variable. That can be done in one DATA step totaling 2*T input/output. Then each of the 10 subgroup joins will need only 0.1*(T+ ~0.1*Tout). You can save even more my creating the detail subgroups containing only the variables of interest.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Re: Join vs Merge 1.2 TB with 110 GB Datasets

Missed SAS Innovate in Orlando?

SAS Training: Just a Click Away