Solved: Issue with secondary/complementary suppression

ciaohermosa · Posted 04-08-2025 06:04 PM

I have a dataset with variables for year, city, indicator, and indicator_options. I have aggregated them together to get the count of all those for indicator_options within the indicator, city and year. I am trying to suppress data that is between 1-5 and then also considering secondary/complementary suppression so that if there is only one missing value, the next lowest value is suppressed. I have been able to do the first round of suppression but am getting stuck with the secondary suppression.

I asked ChatGPT, which gave me a pretty good solution:

DATA trends;
    SET trends;
    BY options indicator city year;

    /* Step 1: Initialize variables for tracking */
    RETAIN missing_count next_min_value;

    /* Initialize for the first record in each group */
    IF first.year THEN DO;
        missing_count = 0;
        next_min_value = .;
    END;

    /* Step 2: Count the missing values in the group */
    IF count_supp = . THEN missing_count + 1;

    /* Step 3: Find the next lowest non-missing value (next smallest value) */
    IF count_supp NE . THEN DO;
        IF next_min_value = . OR count_supp < next_min_value THEN next_min_value = count_supp;
    END;

    /* Step 4: Modify the next smallest non-missing value if exactly one missing value exists in the group */
    IF last.year AND missing_count = 1 THEN DO;
        IF count_supp = next_min_value THEN count_supp = .; /* Set the next smallest non-missing value to missing */
    END;

    /* Step 5: Output the modified rows */
    count_supp_new = count_supp; /* Create the new column for modified values */
    OUTPUT;

    /* Drop temporary variables */
    DROP missing_count next_min_value;
RUN;

However, the problem arises with Step 4 which seems to just be looking at the last row, not the next lowest value within the group. I can't figure out how to modify the code so that it considers the whole group.

Here is the output with the code:

year	city	indicator	options	count	count_supp	count_supp_new
2017	Paris	job title	Consultant	8	8	8
2019	Paris	job title	Consultant	6	6	6
2021	Paris	job title	Consultant	4	.	.
2023	Paris	job title	Consultant	8	8	.

Here is what it should look like with the next lowest value marked as missing/suppressed.

year	city	indicator	options	count	count_supp	count_supp_new
2017	Paris	job title	Consultant	8	8	8
2019	Paris	job title	Consultant	6	6	.
2021	Paris	job title	Consultant	4	.	.
2023	Paris	job title	Consultant	8	8	8

quickbluefish · Posted 04-09-2025 10:14 AM

Here's a modified version that does that:

data have;
infile cards dsd truncover firstobs=1 dlm='09'x;
length year 3 city indicator options $30 count 3;
input year city indicator options count;
cards;
2017	Paris	job title	Consultant	8
2019	Paris	job title	Consultant	6
2021	Paris	job title	Consultant	4
2023	Paris	job title	Consultant	8
;
run;

data have;
length srt 3;
set have;
srt=_N_;
run;

proc sort data=have; by city indicator options count; run;

data want;
set have;
by city indicator options;
length nsupp count_supp count_supp_new 3;
if first.options then nsupp=0;
count_supp=count;
count_supp_new=count;
if count<5 or nsupp=1 then do;
	count_supp_new=.;
	if count<5 then count_supp=.;
	nsupp+1;
end;
drop nsupp;
run;

proc sort data=want; by srt; run;

proc print data=want; run;

View solution in original post

quickbluefish · Posted 04-08-2025 06:23 PM

I'm not sure if your data are more complicated than what you've shown here, but if not, then this seems to work - suppresses counts <5 and will suppress the next largest if the number suppressed has not reached two yet:

data have;
infile cards dsd truncover firstobs=1 dlm='09'x;
length year 3 city indicator options $30 count 3;
input year city indicator options count;
cards;
2017	Paris	job title	Consultant	8
2019	Paris	job title	Consultant	6
2021	Paris	job title	Consultant	4
2023	Paris	job title	Consultant	8
;
run;

data have;
length srt 3;
set have;
srt=_N_;
run;

proc sort data=have; by city indicator options count; run;

data want;
set have;
by city indicator options;
length nsupp supp_count 3;
if first.options then nsupp=0;
supp_count=count;
if count<5 or nsupp=1 then do;
	supp_count=.;
	nsupp+1;
end;
drop nsupp;
run;

proc sort data=want; by srt; run;

proc print data=want; run;

...would definitely test thoroughly, though.

ciaohermosa · Posted 04-09-2025 09:56 AM

Thank you so much for responding! This is really helpful and works for the case I provided and several others.

Yes my data is more complex/larger than the example I gave, but was just showing a snippet. It looks like your code worked for several scenarios, but I found an issue in one setting - see below.

Below is the output:

year	city	indicator	indicator_options	count	count_supp	count_supp_new
2017	London	job site	Building A	3	.	.
2019	London	job site	Building A	9	9	.
2021	London	job site	Building A	2	.	.
2023	London	job site	Building A	2	.	.

But this is what I should get- any ideas on how to adjust the code to make this work?

year	city	indicator	indicator_options	count	count_supp	count_supp_new
2017	London	job site	Building A	3	.	.
2019	London	job site	Building A	9	9	9
2021	London	job site	Building A	2	.	.
2023	London	job site	Building A	2	.	.

quickbluefish · Posted 04-09-2025 10:09 AM

I might not be understanding. What's the difference between 'count_supp' and 'count_supp_new'? Is the former supposed to contain everything >=5 and the latter supposed to contain everything >=5 that was also not removed for complementary reasons?

ciaohermosa · Posted 04-09-2025 10:25 AM

Count is the original count of the category. count_supp is the the results after the first round of suppression (marking everything <5 as suppressed - easy). Count_supp_new is for the second round, which would then check if there is one value <5 or one value that is already marked as suppressed, it suppresses the next lowest value within the group. It just separates it out into separate columns so I can view the changes for each step to check the results. The only outcome needs to be one final suppressed column.

As I am writing this, I think your code does that. What I need instead I am realizing is a way to not run the secondary suppression if there are already 2 suppressed values present in count_supp. The goal overall of the secondary suppression is that someone looking at the data can't go back and figure out what the one suppressed number is by doing simple math. So for example, if the counts are 2, 10, 11, and 12, and only 2 is suppressed, you can figure out that 2 is the missing number because you know the sum is 35. What I need to do is suppress 2 and 10. But in the case where the values are 2, 3, 11, and 12, both 2 and 3 are suppressed in the first round of suppression, it doesn't matter that I know the sum is 28 because I don't know how the remaining 5 that aren't showing are divided between the 2 suppressed categories.

Hope that makes sense...sorry for the confusion!

quickbluefish · Posted 04-09-2025 10:33 AM

Yes, does the updated code do what you're looking for? I'm very familiar with the concept as it's part of my daily work. It can definitely get a lot trickier when the table you create with these data also have both row and column totals in addition to the individual counts.

ciaohermosa · Posted 04-09-2025 10:39 AM

Unfortunately the updated code doesn't work. I think the issue is that the code is suppressing the next value, when what it needs to do is suppress the next lowest count_supp value within the group and also if there are already 2 suppressed (or missing = .) values within the group, then it should not do any further suppression.

quickbluefish · Posted 04-09-2025 10:44 AM

OK, what are the raw numbers for which it's not working? It seems to be doing what you're asking for in both cases above (with raw counts of [3, 9, 2, 2] or [8,6,4,8]).

ciaohermosa · Posted 04-10-2025 09:00 AM

This worked! Sorry there was an issue with the way I was sorting the real data, but once that was corrected the code functioned. Thanks so much for the time you put into this and the quick response!

quickbluefish · Posted 04-09-2025 10:14 AM

Here's a modified version that does that:

data have;
infile cards dsd truncover firstobs=1 dlm='09'x;
length year 3 city indicator options $30 count 3;
input year city indicator options count;
cards;
2017	Paris	job title	Consultant	8
2019	Paris	job title	Consultant	6
2021	Paris	job title	Consultant	4
2023	Paris	job title	Consultant	8
;
run;

data have;
length srt 3;
set have;
srt=_N_;
run;

proc sort data=have; by city indicator options count; run;

data want;
set have;
by city indicator options;
length nsupp count_supp count_supp_new 3;
if first.options then nsupp=0;
count_supp=count;
count_supp_new=count;
if count<5 or nsupp=1 then do;
	count_supp_new=.;
	if count<5 then count_supp=.;
	nsupp+1;
end;
drop nsupp;
run;

proc sort data=want; by srt; run;

proc print data=want; run;

quickbluefish · Posted 04-10-2025 09:15 AM

Glad that helped. I have always had trouble figuring out a way to automate this process, as I said, for tables that involve row and column totals -- see below for an example. Here, we've suppressed the count that's below the threshold (4) and the next smallest one (6) to mask it, so the row total for Paris can't be used to back-calculate the 4. However, because of the column totals, we can currently back calculate either of the suppressed numbers, which leaves the question of which numbers to suppress in those columns, which themselves, in turn, need to be masked so that you can't back calculate from those row totals. It gets pretty confusing...

Issue with secondary/complementary suppression

Re: Issue with secondary/complementary suppression

Re: Issue with secondary/complementary suppression

Re: Issue with secondary/complementary suppression

Re: Issue with secondary/complementary suppression

Re: Issue with secondary/complementary suppression

Re: Issue with secondary/complementary suppression

Re: Issue with secondary/complementary suppression

Re: Issue with secondary/complementary suppression

Re: Issue with secondary/complementary suppression

Re: Issue with secondary/complementary suppression

Re: Issue with secondary/complementary suppression

The 2025 SAS Hackathon Kicks Off on June 11!

SAS Training: Just a Click Away