This article shall explain in more detail how we try to eliminate False Positives.
There are 4 main methods we use to eliminate False Positives:
- MOD10 verification
This is the most basic of checks and will eliminate a large number of false positives statistically speaking, such as random sequences of digits which closely resemble a payment card.
As far as we're concerned, the only PANs which do not conform to MOD10 are older PANs which are no longer in active circulation.
- Length/Prefix checks
Our scanning engine applies a series of common sense prefix checks, using the published card scheme prefixes such as '4' for VISA, 3[4-7] for American Express, and so on.
We expect to eliminate 4 out of 5 matches using this checking method.
- Native format decoding
Our decoding engine natively understands a vast number of common file formats where sensitive data is identified, such as documents, emails, structured and unstructured data (XML, CSV, etc.) and so on.
This also includes complex file types such as email storage formats (eg. Microsoft Outlook PST and OST) and Database formats (eg. Microsoft Access, Microsoft SQL LDF/MDF).
By scanning file formats and understanding the native underlying data structure, sensitive information is identified in its clear unobstructed form which dramatically reduces the likelihood of false positives.
The decoding engine will never skip any file even if the format is unrecognized.
It will still attempt to accurately scan an unrecognized file by using a fall-back decoding method that we refer to as generic binary decoding.
This will always be used as a last resort and enables the decoding engine to strip out all the binary data that normally causes high false positive levels and scan only the remaining clear text data available.
- Contextual data & statistical analysis
This is a key differentiator that enables the scanning engine to refine its results further and provides a final layer of False Positives scrubbing.
In the initial development stages of the underlying scanning engine, our engineering team spent considerable time analyzing sets of both genuine matches and false positives in order to determine the characteristics of each, and found that the majority of false positive data falls into very specific contextual patterns.
For example, a common False Positive scenario is uncompressed bitmap data (images, sound, icons, etc), however even if the file type cannot be identified, bitmap data can be readily identified to an almost certain degree of accuracy by examination of approximately 300 bytes worth of data before & after the match, and applying a series of algorithms to determine its true context.
Similar methods can be used for other data types, for example to determine if a 16-digit string is really a PAN, or a web site cookie token.
Analysis over the overall findings against a given file will also provide further verification as to the likelihood of any given finding.
Lastly, if our scanning engine is ever in doubt, it will report the match as a PAN and let the User determine the accuracy of the match.
All information in this article is accurate and true as of the last edited date.