Menu Close

Machine learning offers more efficient tools to smoke out fraudsters, as students from the Harvard John A. Paulson School of Engineering and Applied Sciences (SEAS) learned during the recent ComputeFest 2018 Student Data Challenge, organized by Institute for Applied Computational Science (IACS).

During the nine-hour hackathon, student teams put their computational skills to the test in a race against the clock, using machine learning techniques to detect fraudulent insurance claims. Presented with data from more than 18,000 health care providers, the open-ended problem challenged students to devise, test, and then refine the best algorithmic technique to identify fraud.

The winning team, consisting of Xuefeng Peng, M.E., a computational science and engineering student, and T.H. Chan School of Public Health master’s students Yi Ding and Linying Zhang, who were able to find fraud with 95.7 percent accuracy. They used an autoencoder, a type of neural network, which learned common patterns in the dataset to decode genuine data points. Since the autoencoder was unable to decode anomalies, it was sensitive to fraudulent claims, Peng explained.

Teammates Amil Merchant, A.B., an applied math concentrator, and Kate Zhou, a first-year mechanical engineering Ph.D. candidate, scoured the web for examples of medical fraud, such as overbilling or prescribing too many drugs.

Another team, comprised of visiting graduate student Christoph Kurz, and Chan School graduate students Hannah James and Anna Zink, tried two approaches in parallel – a linear regression model and a random forest algorithm – to study patterns and distinguish outliers. The students were surprised to find that the simplest technique, linear regression modeling, yielded the best results.

The massive data set included 86 features, such as percentage of a provider’s patients who suffer from depression or diabetes. Representing those features in a model through linear and nonlinear combinations was a challenge, said Alexander Munoz, A.B., an applied math concentrator.

Each team was able to submit an answer three times per hour, but only received feedback on how accurate their results were collectively. Using their most recent feedback, Munoz and teammates Eshan Tewari, A.B., and statistics Ph.D. candidate Niloy Biswas considered which features to include in the next iteration of their model.

The challenge was designed to teach students some fundamental machine learning techniques, while emphasizing their practical applications, said competition architect Marouan Belhaj, an IACS Fellow.

Many students had never encountered an unsupervised problem before, but those are the types of situations fraud detection agents often face, where billions of dollars and millions of lives are at stake. With so many ways to trick the system, machine learning is an ideal method to detect fraud quickly and precisely, while there is still time to intervene, he said.

“In real-world fraud detection, you rarely get feedback from inside the company about how your model is performing,” he said. “To improve the model is very difficult. You really need to think like a hacker or someone trying to defraud the system to understand which techniques you might use to trick the system and then try, through the modeling, to see if your results actually confirm your ideas.”