Hi, I'm from an Australian company trying to license its English written language data to an third party for research purposes, but we and they want to ensure it has been thoroughly checked for any PII/personally-identifiable-information.
After processing the raw data a bit there are about 1.2M "words", of which 100k exactly match dictionary words, which are used about 400M times in about 20M messages.
Analysis done so far shows that 95% of the word uses were covered by the most common 1% of the words, but we need to get to a point where when we go through the redacted messages we rarely see redactions, and when we look up what was redacted it is clear that it was redacted for a sensible reason (and at least that it contains no information that would be useful).
This means getting to the high 99.9s, that means a lot of words need to be processed, and it isn't going to be done by wetware.
I am going to apply a spell checker for a couple of languages and break apart hypens etc to try and get as many words as possible checked off.
Each of the "words" are linked to:
- Individual messages
- User groups
The idea is that we get a couple people from 2-3 freelance sites and give each a random, partly overlapping subset of the word set, and a set of categories that we want to organize the words by. These will be e.g.:
- English dictionary
- English typo / abbreviation
- Word relating to PII (phone, wife, street)
- Word commonly found in public datasets
- Verb / Noun / Adjective
- Possible individual name
- Possible website / e-mail / phone number fragment
- Likely username / user handle
- Country / large location name
- Street / suburb / small location name
- Long garbled text (no information)
- Long garbled text (suspicious)
- Short garbled text (no information)
- Short garbled text (suspicious)
- Word fragment
- Foreign text
And you can give these confidence scores from 0-100 rather than a firm category, and decide how you would like to approach the problem most efficiently.
We'll sprinkle in some known words that we expect to get categorized a certain way, and cross-check the overlapping results we get back with other freelancers.
It should be pretty clear from the results whether you know what you're doing, and we can start a full job to process the bulk of the data with confidence.
After this hopefully we will have an ability to start removing clearly personal / suspect words, and whitelisting clearly non-personal words.
This will let us safely release semi-redacted full messages, which we can apply another round of analysis to. Once we have a good feel for message categorization we can go through the message data grouped by (anonymized) users and user groups, allowing a final stage of analysis looking for outliers within users and groups.
At this point the data will be sanitized for our client, and we will be able to get you to use the toolsets developed to perform the redaction of ongoing monthly updates we will be providing.
Ideally you will have no problem signing an NDA and will be from a country that generally respects IP law.
This isn't any kind of confidential data, but we want our users to be reassured that their already anonymous data was analyzed for PII at length by select data scientists.
I am hoping as a first step ~$300 and you take a crack at processing ~300k words. I know there is usually more time spent when you start, and the next 900k words takes about as long as the first 300k, but the next stage will increase in volume and complexity if you're the right candidate.