I have a dataset with variables for year, city, indicator, and indicator_options. I have aggregated them together to get the count of all those for indicator_options within the indicator, city and year. I am trying to suppress data that is between 1-5 and then also considering secondary/complementary suppression so that if there is only one missing value, the next lowest value is suppressed. I have been able to do the first round of suppression but am getting stuck with the secondary suppression.
I asked ChatGPT, which gave me a pretty good solution:
DATA trends; SET trends; BY options indicator city year; /* Step 1: Initialize variables for tracking */ RETAIN missing_count next_min_value; /* Initialize for the first record in each group */ IF first.year THEN DO; missing_count = 0; next_min_value = .; END; /* Step 2: Count the missing values in the group */ IF count_supp = . THEN missing_count + 1; /* Step 3: Find the next lowest non-missing value (next smallest value) */ IF count_supp NE . THEN DO; IF next_min_value = . OR count_supp < next_min_value THEN next_min_value = count_supp; END; /* Step 4: Modify the next smallest non-missing value if exactly one missing value exists in the group */ IF last.year AND missing_count = 1 THEN DO; IF count_supp = next_min_value THEN count_supp = .; /* Set the next smallest non-missing value to missing */ END; /* Step 5: Output the modified rows */ count_supp_new = count_supp; /* Create the new column for modified values */ OUTPUT; /* Drop temporary variables */ DROP missing_count next_min_value; RUN;
However, the problem arises with Step 4 which seems to just be looking at the last row, not the next lowest value within the group. I can't figure out how to modify the code so that it considers the whole group.
Here is the output with the code:
year | city | indicator | options | count | count_supp | count_supp_new |
2017 | Paris | job title | Consultant | 8 | 8 | 8 |
2019 | Paris | job title | Consultant | 6 | 6 | 6 |
2021 | Paris | job title | Consultant | 4 | . | . |
2023 | Paris | job title | Consultant | 8 | 8 | . |
Here is what it should look like with the next lowest value marked as missing/suppressed.
year | city | indicator | options | count | count_supp | count_supp_new |
2017 | Paris | job title | Consultant | 8 | 8 | 8 |
2019 | Paris | job title | Consultant | 6 | 6 | . |
2021 | Paris | job title | Consultant | 4 | . | . |
2023 | Paris | job title | Consultant | 8 | 8 | 8 |
Here's a modified version that does that:
data have;
infile cards dsd truncover firstobs=1 dlm='09'x;
length year 3 city indicator options $30 count 3;
input year city indicator options count;
cards;
2017 Paris job title Consultant 8
2019 Paris job title Consultant 6
2021 Paris job title Consultant 4
2023 Paris job title Consultant 8
;
run;
data have;
length srt 3;
set have;
srt=_N_;
run;
proc sort data=have; by city indicator options count; run;
data want;
set have;
by city indicator options;
length nsupp count_supp count_supp_new 3;
if first.options then nsupp=0;
count_supp=count;
count_supp_new=count;
if count<5 or nsupp=1 then do;
count_supp_new=.;
if count<5 then count_supp=.;
nsupp+1;
end;
drop nsupp;
run;
proc sort data=want; by srt; run;
proc print data=want; run;
I'm not sure if your data are more complicated than what you've shown here, but if not, then this seems to work - suppresses counts <5 and will suppress the next largest if the number suppressed has not reached two yet:
data have;
infile cards dsd truncover firstobs=1 dlm='09'x;
length year 3 city indicator options $30 count 3;
input year city indicator options count;
cards;
2017 Paris job title Consultant 8
2019 Paris job title Consultant 6
2021 Paris job title Consultant 4
2023 Paris job title Consultant 8
;
run;
data have;
length srt 3;
set have;
srt=_N_;
run;
proc sort data=have; by city indicator options count; run;
data want;
set have;
by city indicator options;
length nsupp supp_count 3;
if first.options then nsupp=0;
supp_count=count;
if count<5 or nsupp=1 then do;
supp_count=.;
nsupp+1;
end;
drop nsupp;
run;
proc sort data=want; by srt; run;
proc print data=want; run;
...would definitely test thoroughly, though.
Thank you so much for responding! This is really helpful and works for the case I provided and several others.
Yes my data is more complex/larger than the example I gave, but was just showing a snippet. It looks like your code worked for several scenarios, but I found an issue in one setting - see below.
Below is the output:
year | city | indicator | indicator_options | count | count_supp | count_supp_new |
2017 | London | job site | Building A | 3 | . | . |
2019 | London | job site | Building A | 9 | 9 | . |
2021 | London | job site | Building A | 2 | . | . |
2023 | London | job site | Building A | 2 | . | . |
But this is what I should get- any ideas on how to adjust the code to make this work?
year | city | indicator | indicator_options | count | count_supp | count_supp_new |
2017 | London | job site | Building A | 3 | . | . |
2019 | London | job site | Building A | 9 | 9 | 9 |
2021 | London | job site | Building A | 2 | . | . |
2023 | London | job site | Building A | 2 | . | . |
Count is the original count of the category. count_supp is the the results after the first round of suppression (marking everything <5 as suppressed - easy). Count_supp_new is for the second round, which would then check if there is one value <5 or one value that is already marked as suppressed, it suppresses the next lowest value within the group. It just separates it out into separate columns so I can view the changes for each step to check the results. The only outcome needs to be one final suppressed column.
As I am writing this, I think your code does that. What I need instead I am realizing is a way to not run the secondary suppression if there are already 2 suppressed values present in count_supp. The goal overall of the secondary suppression is that someone looking at the data can't go back and figure out what the one suppressed number is by doing simple math. So for example, if the counts are 2, 10, 11, and 12, and only 2 is suppressed, you can figure out that 2 is the missing number because you know the sum is 35. What I need to do is suppress 2 and 10. But in the case where the values are 2, 3, 11, and 12, both 2 and 3 are suppressed in the first round of suppression, it doesn't matter that I know the sum is 28 because I don't know how the remaining 5 that aren't showing are divided between the 2 suppressed categories.
Hope that makes sense...sorry for the confusion!
Unfortunately the updated code doesn't work. I think the issue is that the code is suppressing the next value, when what it needs to do is suppress the next lowest count_supp value within the group and also if there are already 2 suppressed (or missing = .) values within the group, then it should not do any further suppression.
This worked! Sorry there was an issue with the way I was sorting the real data, but once that was corrected the code functioned. Thanks so much for the time you put into this and the quick response!
Here's a modified version that does that:
data have;
infile cards dsd truncover firstobs=1 dlm='09'x;
length year 3 city indicator options $30 count 3;
input year city indicator options count;
cards;
2017 Paris job title Consultant 8
2019 Paris job title Consultant 6
2021 Paris job title Consultant 4
2023 Paris job title Consultant 8
;
run;
data have;
length srt 3;
set have;
srt=_N_;
run;
proc sort data=have; by city indicator options count; run;
data want;
set have;
by city indicator options;
length nsupp count_supp count_supp_new 3;
if first.options then nsupp=0;
count_supp=count;
count_supp_new=count;
if count<5 or nsupp=1 then do;
count_supp_new=.;
if count<5 then count_supp=.;
nsupp+1;
end;
drop nsupp;
run;
proc sort data=want; by srt; run;
proc print data=want; run;
Glad that helped. I have always had trouble figuring out a way to automate this process, as I said, for tables that involve row and column totals -- see below for an example. Here, we've suppressed the count that's below the threshold (4) and the next smallest one (6) to mask it, so the row total for Paris can't be used to back-calculate the 4. However, because of the column totals, we can currently back calculate either of the suppressed numbers, which leaves the question of which numbers to suppress in those columns, which themselves, in turn, need to be masked so that you can't back calculate from those row totals. It gets pretty confusing...
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.