The latest version of ChatGPT was able to create an entirely fake dataset — one that showed better results for one ophthalmic procedure over another, a research letter in JAMA Ophthalmology showed.
As prompted, GPT-4 with its “Advanced Data Analysis” technology made up the data and showed a significantly better post-operative best spectacle-corrected visual acuity (BSCVA) and topographic cylinder for deep anterior lamellar keratoplasty (DALK) compared with penetrating keratoplasty (PK) (P<0.001), according to Giuseppe Giannaccare, MD, PhD, of the University Magna Graecia of Catanzaro and the University of Caligari in Italy, and colleagues.
“GPT-4 created a fake dataset of hundreds of patients in a matter of minutes and the data accuracy vastly exceeded our expectations,” Giannaccare told MedPage Today in an email. “To be honest, this was a surprising, yet frightening experience!”
“The aim of this research was to shed light on the dark side of AI, by demonstrating how easy it is to create and manipulate data to purposely achieve biased results and generate false medical evidence,” he added. “A Pandora’s box is opened, and we do not know yet how the scientific community is going to react to the potential misuses and threats connected to AI.”
Giannaccare noted that while some experts have raised concerns about the use of generative AI in manuscript texts, “few authors have addressed the threat of malicious data manipulation with AI in the medical setting.”
“Data manipulation is a very well-known issue in academia; however, AI may dramatically increase its risk, and academics are not paying enough attention to this issue,” he added.
The capabilities of GPT-4 have recently been expanded with Advanced Data Analysis, which uses the programming language Python to enable statistical analysis and data visualization, the researchers explained.
To assess whether it could indeed create a fake dataset with skewed results, the researchers prompted it to fabricate data for 300 eyes belonging to 250 patients with keratoconus who underwent either DALK or PK. Giannaccare said the team submitted “very complex” prompts to GPT-4, which contained a “large set of rules for creating the desired cohort population.”
“The required data included sex distribution, birthdate, date and type of surgery, preoperative and postoperative best spectacle-corrected visual acuity, topographic cylinder, intraoperative and postoperative complications,” he said. They also prompted it to generate “significantly better visual and topographic results” for DALK over PK, he added.
Overall, the researchers found that “almost all” the criteria were met in the fake dataset “and it is hard to find a difference between a genuine dataset and the one [created] by AI,” Giannaccare told MedPage Today. And it was capable of producing results that favored one procedure over another.
They did note, however, that the data ranges of continuous variables were not always accurate. Nonetheless, Giannaccare said, it would be possible “to submit more consecutive prompts … fine-tuning the statistical properties of the fake dataset by including additional data columns, fixing mistakes, and obtaining more desirable statistical outcomes. Besides, we asked GPT-4.0 to fabricate data based only on ranges and means; however, it is theoretically possible to ask for specific target standard deviation, confidence interval values, and adjust the shape of data distribution.”
“The possibilities are endless, and increasing the quality of the prompts may lead to even more detailed and realistic datasets compared to the one we fabricated,” he said.
Data manipulation has already been a challenge in academia, and now it may only get harder, he cautioned.
“It may be possible to scan datasets to check for suspicious patterns of data. For instance, real-world data typically contains outliers, which might not appear in an AI-generated dataset with fixed ranges set by the user,” he said. “However, well-designed prompts may include more specific rules to fix this and other possible flaws. In the future, we will witness an ongoing tug-of-war between fraudulent attempts to use AI and AI detection systems.”
Despite those threats, Giannaccare said, “an appropriate use of AI can be highly beneficial to scientific research, and our ability to regulate this valuable tool is going to make a substantial difference on the future of academic integrity.”
Authors had no conflicts of interest.
Source Reference: Taloni A, et al “Large language model advanced data analysis abuse to create a fake data set in medical research” JAMA Ophthalmol 2023; DOI: 10.1001/jamaophthalmol.2023.5162.