How to deterministically mask a field with sensitive data in a view

I want to give a third party access to our data and plan to provide them with a read-only view on the data that masks or removes all sensitive fields. What makes this tricky is that I have two collections that are connected with each other via an “id” field, which also contains sensitive data.

I can’t remove the “id” field because it is needed to combine the two collections with each other.

My current idea is to use something like md5_hash to deterministically mask the “id” field. As I understand it, this would need to be a custom function in the aggregation pipeline which would be quite slow (if possible at all).

Is there a better approach to deterministic masking in a view?

Hi @Erik_Weitnauer,
I’ll preface this by saying that I’ve never used it, but this looks like it might be useful to you from a quick reading:

Regards

How about redact?

Actually, you want a 1:1 mapping but to mask the data with something you can tie back later?

Hi Fabio, this looks very interesting! I don’t think it quite applies to my situation though, since I’d only want to use the encryption inside the view, which doesn’t seem possible.

Yes, that’s it exactly! I want a 1:1 mapping that hides the personal information, but still allows me to tie things together.

For example, I might use student IDs in several collections and I want to mask them, but still need them to be consistent (i.e., the same ID should lead to the same masked value, and different IDs should lead to different masked values). Applying a hashing function would work for that.

My question is if there is an efficient way to do that as part of a view that I define.

Why do you need to mask the StudentID? Is their actual Student ID? Could you use an alternative value to link the values, generate a new ID or something then it does not matter if it’s visible.

We did try and do something like this recently, we wanted to mask something like locations but have them remain a 1:1 mapping. In the end we just pulled that data from the output completely as opposed to writing a mapping routine, so not much help for you there I’m afraid!

I know we have other systems in-house that do this on the IBM systems when refreshing development environments so only key people can actually see the information, but that’s done via logic in stored procedures or cobol.

The MD5 has would leave you open to the possibility of a hash collision I believe anyway (given that it’s a small chance!), could you add a new field to the system that does not contain private information?

Sorry not more help, there is a similar thread here: