Five Things AI Can Help Secure Your Personal Data privacy information discovery

Major Data Breaches are on the Rise

A recent ABC news headline reported that data breaches affecting millions of Australians are on the rise. Indeed, according to the 2022 Notifiable Data Breaches Report published by the Office of Australian Information Commissioner (OAIC), there was a 67% increase in large data breaches between the first and second half of the year. Securing personal information is not just a job description of IT security managers; it now appears on the board meeting agenda of top Australian enterprises.

A good reference for organisations planning personal data security is the guide for securing personal information published by OAIC, which outlines a comprehensive yet practical data protection framework that is consistent with Australia Privacy Principles (APP) principles. Central to the OAIC guide is the concept of personal information lifecycle, which defines protective measures spanning from the beginning of collecting personal data to the end of destroying personal data.

Implementing each phase in the personal information lifecycle requires an organization to command a good knowledge of personal information embedded in its data. For example, in assessing the impacts of personal data breaches and to de-identify personal data, an organisation needs to know all the instances and locations of government IDs such as TFNs and medicare numbers in its data assets like databases and repositories. Large organisations typically run thousands of applications that rely on tens of thousands tables, hundreds of thousands columns and hundreds of millions rows of data. Uncovering personal information in large amount of data is a very challenging task.

Obviously, we need an automated discovery process to systematically discover and catalogue privacy information in the entire data assets managed by an organization. Traditional techniques like using REGEX (the standard for textual pattern matching) are very limited because personal information exhibits far too many variations to be covered by a handful of rules and patterns. Neither can rule or pattern based approaches make good use of contextual information in decision making. AI-assisted models that are trained over a large amount of data (e.g. tens of millions of records) promise the right solutions because they can discover privacy information in large datasets efficiently and on scale with an accuracy and versatility comparable to human data privacy specialists.

So here are the top five things that you can turn to an AI-assisted privacy discovery tool in securing personal information in your data.

Detecting a wide range of PII and PSI classes

The fundamental ability of AI-assisted privacy discovery tools is to detect the classes of Personal Identifiable Information (PII) and Personal Sensitive Information (PSI) defined by OAIC. Such tools should detect common PII instances such as names, addresses, emails as well as well known government IDs such as medicare numbers, TFN/TIN, SSNs. Additionally, the tool should also be able to detect personal sensitive information, such as age, health and credit information and criminal history, for instance,

[credit information]
David has a home equity loan from challenger bank limited with available credit of more than the house value in his name and is looking to refinance it into another mortgage product that will help him pay off debts faster without any additional monthly payments.

[criminal record]
Tom Jones was charged with 2 counts of dangerous driving with malicious intent and  auto theft. He is due to face court next week.

While many data protection platforms (e.g. AWS MACIE and Azure Priva) are able to detect a long list of PII classes out-of-the-box, detecting sensitive information typically requires more sophistication; the tools must understand the context of privacy information and draw upon numerous language patterns in assigning a sensitivity class. For example, a date becomes sensitive when it is a birth date or a diagnosis date. To decide if a text note mentions a criminal activity, the tool will need to know all types of crimes. Without understanding context and the meaning of text, the tools will detect many false positives privacy instances, compromising the accuracy of detection.

Scanning a variety of data objects

Privacy information resides in a variety of data objects and we did observe that cyber attacks can target any type of data objects. Many tools simply offer the capability of detecting PII in text; application developers have to decide and program the logic of scanning data objects to apply the algorithm. Advanced discovery tools can scan various types of data to uncover privacy information unique to each data object. This process, known as "deep discovery," yields more valuable and comprehensive information to enhance data protection.

Example of discovering PI in tables

Given a contrived HR table like below,

The AI-assisted tool can perform the following discovery:

The table as a whole is classified as “high risk”, as it contains highly personally identifiable information such as TFNs and sensitive information of salaries. The table level classification helps data governance tighten access rights to the table.
A table topics can be calculated and assigned to the table. The table topic can be used to indicate the business intent or personal information contents of the table and be used for matching/clustering tables with similar topics.
Columns containing personal information are detected and classified. Columnar level classification enables dynamic de-identification or redaction of data retrieved from specific columns.
Table rows or note fields in rows are scanned to detect PII and PSI, enabling fine-granular de-identification and redaction of PII and PSI. For example, a customer support record containing sensitive information should be de-identified when it is exported.

Example of discovering PI in API

AI-assisted tools can analyse samples of API payloads to detect data paths and values containing privacy information. APIs often aggregate personal information from multiple data sources and therefore it is important to evaluate the privacy information exposed through APIs.

Example of discovering PI in scanned images

AI assisted tools are also able to scan unstructured data objects, like images and PDF documents. For example, given a scanned image like below, the tool can detect:

It is an image of a driver’s licence.
It is a Queensland driver’s licence.
The personal information in the licence, e.g. name, address, DoB, licence number.
The facial image of a person.

AI tools utilise computer vision to understand layouts, OCR to extract text, NLP to detect privacy information, as well as facial recognition to detect or verify a customer.

Cataloguing and analyzing privacy information in data

AI-tools can work in a stateless discovery mode; the discovered information is handed over to other data governance tools, the tool itself does not store or manage the discovered privacy information. More sophisticated tools can capture and maintain a catalogue of the discovered information in relation to the containing data objects. Whether the discovered privacy information is maintained by a data governance tool or an AI discovery tool, the links between discovered information and actual data objects form the basis for analysing and securing personal information in data.

For example, by storing discovered privacy as knowledge graphs about data objects, we can query a graph database to return all table columns that contain medicare numbers, which can be visualised as below. Note the orange nodes indicate table data objects, blue nodes indicating column data objects and grey nodes representing classified PI classes.

To support the privacy impact assessment (PIA) recommended by OAIC, we can apply aggregation functions to calculate the number of government IDs stored in a dataset so that we understand how many customers will be impacted if the dataset is breached. To carry out the privacy by design activity, we can apply sophisticated analytics such as identifying the centroid of privacy information (i.e. tables containing most privacy information) in a schema or in a set of data assets involved in a project. When a dataset is exported or accessed, the privacy information associated with the dataset can be used to advise the middleware if data in certain columns should be de-identified.

Predicting privacy risks, vulnerability and retention/disposals

When the discovered privacy information is put into the context of its “data landscape”, which is made up of metadata about upstream and downstream applications, information flow, users consuming the data etc, AI tools can also play a role in assessing risks, predicting data vulnerabilities and recommending data retention and disposals.

For example, we can pull together information such as

privacy information in data objects,
their exposures to Internet facing applications,
downstream and upstream applications,
usage matrices,
dependent business processes and applications,
dependent regulatory reports,

We can then train a model to predict risk levels of each data object.

If a data breach is notified either internally or externally, our immediate reaction would be "could this happen to us", or ask outselves a more educated question “do we have similar data assets that could be compromised like this incident”. AI tools can help detect the vulnerable datasets and configurations using similarity matching algorithms. For example, if a breach incident is described using the VERIS incident description framework, AI tools can search the overarching data landscape covering privacy information for patterns matching the breach description.

Similarly, by incorporating business processes, dependency information, risk classes and record retention/disposal policies, AI can recommend retention/disposals for specific data records, either using a ML model or using a rule engine supporting logical deduction.

Data risk assessment, vulnerability detection and data retention and disposal are often the responsibilities of separate teams and business units in a large organisation. Understandably, pulling information together from different data custodians to train models for risk assessment and vulnerability detection is itself a tremendous hurdle. The bottom line here is that ML models trained over privacy information and other metadata can deliver powerful predictive capabilities for securing personal information.

Customizing and finetuning underlying ML models

AI tools often ship with multiple out-of-the-box PII detection models that organisations can use straight away. Ready-to-use models greatly reduce the effort on the part of organisations in terms of preparing and labelling training datasets, implementing ML algorithms and setting up training infrastructures.

However, data can be quite different across businesses and industries. A Commercial-off-the-shelf model may not work well for an organisation either because it misses some critical privacy classes or does not deliver the expected accuracy. In such cases, the AI discovery tool should permit organisations to customise its internal models to ensure the discovery process achieves expected accuracy against the organisation’s datasets.

Moreover, an organisation’s datasets may contain some unseen data that are not anticipated by the AI models even after customisation. When human users identify incorrectly discovered instances, the AI model should incorporate user-provided corrective data to continuously finetune the model such that the model won’t make the same mistakes the next time it sees the same data.

Key Take-aways

Data breaches affecting millions of Australians are on the rise. We expect more companies and government departments to proactively implement measures to secure personal information in their data assets.
OAIC outlines a guiding framework for securing personal information, which centres around personal information lifecycle. Each key activity associated with the personal information lifecycle should be advised by the knowledge of personal information in an organisation’s data assets.
The knowledge of personal information can be automatically discovered and organised by AI-assisted privacy information discovery tools. The AI-assisted tools provide five essential functions for securing personal information.
1. Detecting a wide range of PII and PSI classes
2. Scanning a variety of data objects
3. Cataloguing and analyzing privacy information
4. Predicting privacy risks, vulnerability and retention/disposals
5. Customising and finetuning underlying ML models

If you are interested in learning more about AI-assisted privacy information discovery and Meaningware's Privacy Information Notes (PIN), you can download our whitepaper here.