Fraud in the Health Insurance Market

An brief overview of fraud in health insurance and the potential for data science to help.

According to Allied Market Research, the global health insurance market size was valued at $1.98 trillion in 2020 and claims cost health insurers upwards of $30 billion in that same year.

Furthermore, in a recent survey conducted by the Property Casualty Insurers Association of America and FICO, insurers stated that fraud constitutes between 5 and 10% of claims costs. This means that up to $3 billion was spent by health insurers on fraudulent claims in 2020.

According to Johns Hopkins HealthCare (JHHC), in the health insurance context, fraud is defined as any deliberate and dishonest act committed with the knowledge that it could result in an unauthorized benefit to the person committing the act or someone else who is similarly not entitled to the benefit. Some examples of healthcare fraud which can be committed by healthcare services providers are:

  • Misrepresentation of the type or level of service provided
  • Misrepresentation of the individual rendering service
  • Billing for items and services that have not been rendered
  • Billing for items and services that are not medically necessary
  • Seeking increased payment or reimbursement for services that are correctly billed at a lower rate

Coding of Healthcare Data

The World Health Organisation (WHO) has developed and maintained a standardised diagnosis coding system called the International Classification of Diseases (ICD) coding system. These codes identify a patient’s health condition or diagnosis, and a series of these codes are documented by the healthcare service provider whenever a patient visits a physician or is admitted to hospital.

Additionally, the Current Procedural Terminology (CPT) coding system is used to describe tests, surgeries, evaluations, and any other medical procedure performed by a healthcare provider on a patient. This coding system is published and maintained by the American Medical Association (AMA).

CPT codes are an integral part of the healthcare provider’s billing process. CPT codes tell the health insurer what procedures were provided by the healthcare provider to the patient, and hence indicate what procedures the provider would like to be reimbursed for. CPT codes work together with ICD codes to create a full picture of the medical process for the insurer. Using a combination of ICD and CPT codes, the insurer can understand the symptoms that the patient arrived with (as represented by the ICD code) and the subsequent procedures performed (represented by the CPT code).

These codes were developed to make sure that there is a consistent and reliable way for health insurance companies to process claims from healthcare providers and pay for health services. That is, ICD codes are often used in combination with CPT codes to make sure that the health condition and the services match. For example, if the diagnosis is bronchitis and the doctor orders an ankle x-ray, it is likely that the x-ray will not be paid for as it is not related to bronchitis. These coding systems can therefore go a long way in preventing fraudulent claims.

Using Computer Assisted Coding (CAC) to Assist with ICD and CPT codes

Many countries around the world have legislated that all healthcare providers submit a comprehensive list of diagnosis and procedure codes to the health insurer whenever a patient is treated. There are over 70 000 diagnosis codes and over 69 000 procedure codes in existence. A huge challenge is that valuable time that healthcare practitioners could use to treat patients is being used to locate and document all the ICD and CPT codes. Providers have started to turn to using computer assisted coding (CAC) to supplement their coding work.

In this context, CAC is a piece of software that draws information from clinical documentation and assigns relevant ICD or CPT codes to that data. The program uses natural language processing (NLP) to analyse the plain-text documentation and determine whether a particular medical reference requires a particular ICD code, a series of ICD codes or none at all.

For example, a patient’s chart containing the words “hypertension” may require an ICD code if the health provider is making a diagnosis or providing services based on that condition. However, “family history of hypertension” does not. The NLP algorithm can determine which is which, then select a particular code, or series of codes, that may be applicable.

For a beginner’s guide to NLP, see my previous article on the topic, which can be found here.


While NLP algorithms may still be considered in their infancy, a degree of human intervention may still be required. As these algorithms are becoming increasingly reliable and better at allowing for subtle nuances, health providers can start to spend less time scanning complex medical documents to identify codes, and more time assisting their patients.

Enjoyed this read?

Stay up to date with the latest AI news, strategies, and insights sent straight to your inbox!

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.