ChatGPT Can Pass Medical Exams but Fails Heart Risk Assessments

Cryptopolitan · 2024-05-06T07:33:03.000Z

ChatGPT has the ability to pass medical exams, according to reports, but it will not be a wise decision to rely on it for some serious health assessments, for example, if a patient with chest pain needs to be hospitalized, according to new research. ChatGPT is clever but fails at heart assessment In research published in the journal PLOS ONE, ChatGPT provided different conclusions by returning inconsistent heart risk levels for the same patient in a study that involved thousands of chest pain patients. A researcher at Washington State University’s Elson S. Floyd College of Medicine, Dr. Thomas Heston, who was also the lead author of the research, said, “ChatGPT was not acting in a consistent manner; given the exact same data, ChatGPT would give a score of low risk, then next time an intermediate risk, and occasionally it would go as far as giving a high risk.” Source: WSU. According to the researchers, the issue is probably due to the degree of randomness built into the recent version of the software, ChatGPT-4, because it helps it diversify its answers to mimic natural language. But Heston says that this same level of randomness does not work for use cases in healthcare and can be dangerous, as it demands a single, consistent answer. Doctors need to quickly evaluate the urgency of a patient’s condition, as chest pains are an everyday complaint in hospital emergency rooms. Some of the very serious patients can be easily identified by their symptoms, but the trickier ones are those who have lower risk, said Dr. Heston, especially when they need to decide whether someone is out of risk enough to be sent home with outpatient care services or should be admitted. Other systems prove more reliable An AI neural network like ChatGPT, which is trained on a high number of parameters with huge datasets, can assess billions of variables in seconds, which gives it the ability to understand a complex scenario faster and in a much more detailed way. Dr. Heston says that medical professionals mostly use two models for heart risk assessments called HEART and TIMI, and he likes software as they use a number of variables, including age, health history, and symptoms, and they rely on fewer variables than ChatGPT. For the research study, Dr. Heston and his coworker, Dr. Lawrence Lewis, of the St. Louis campus of the same university, used three datasets of 10,000 randomly simulated cases each. One data set had five variables from the heart scale; another included seven variables from the TIMI; and the third had 44 variables that were randomly selected. For the first two datasets, ChatGPT produced inconsistent risk assessment 45% to 48% of the time on the individual simulated cases compared to a constant score of TIMI and HEART. But for the third dataset, despite running it multiple times, ChatGPT returned different results for the same cases. Dr. Heston thinks that there is greater potential for GenAI in healthcare as the technology advances, despite the unsatisfactory findings of the study. According to him, medical records can be uploaded to the systems, and if an emergency arrives, doctors could ask ChatGPT to provide the most important facts about the patient. It can also be asked to generate some possible diagnoses and the reasoning for each one, which will help doctors see through a problem.

報道によると、ChatGPTは健康診断に合格する能力があるが、新たな研究によると、例えば胸痛の患者を入院させる必要がある場合など、深刻な健康評価にChatGPTに頼るのは賢明な判断ではないという。
ChatGPTは賢いが、心の評価には失敗している
PLOS ONE誌に掲載された研究によると、ChatGPTは、胸痛患者数千人を対象とした研究で、同じ患者に対して一貫性のない心臓リスクレベルを返し、異なる結論を示しました。
ワシントン州立大学エルソン・S・フロイド医学部の研究者で、この研究の筆頭著者でもあるトーマス・ヘストン博士は、次のように述べた。
「ChatGPT の動作は一貫していませんでした。まったく同じデータを与えても、ChatGPT は低リスクのスコアを出し、次は中リスク、そして時には高リスクのスコアを出すこともありました。」
出典: WSU。
研究者によると、この問題はおそらく、最新バージョンのソフトウェアである ChatGPT-4 に組み込まれたランダム性の程度によるもので、これにより回答が多様化して自然言語を模倣できるようになる。しかし、ヘストン氏は、この同じレベルのランダム性は医療のユースケースには適しておらず、単一の一貫した回答を要求するため危険である可能性があると述べている。
胸痛は病院の救急室で日常的に訴えられる症状であるため、医師は患者の症状の緊急性を迅速に評価する必要があります。
ヘストン医師は、重篤な患者の一部は症状から簡単に特定できるが、リスクが低い患者の方が難しいと述べ、特に外来治療サービスを受けて帰宅させるほどリスクが低いのか、入院させるべきなのかを判断する必要がある場合は難しいと語った。
他のシステムの方が信頼性が高いことが証明された
ChatGPT のような AI ニューラル ネットワークは、膨大なデータセットを使用して多数のパラメータでトレーニングされており、数十億の変数を数秒で評価できるため、複雑なシナリオをより迅速かつ詳細に理解することができます。
ヘストン博士は、医療専門家は心臓リスク評価に主に HEART と TIMI という 2 つのモデルを使用しており、年齢、健康歴、症状など多くの変数を使用し、ChatGPT よりも変数の数が少ないため、このソフトウェアを気に入っていると述べています。
この研究調査では、ヘストン博士と、同大学セントルイスキャンパスの同僚であるローレンス・ルイス博士が、それぞれランダムにシミュレートされた10,000件の症例からなる3つのデータセットを使用しました。データセットの1つには心臓スケールの5つの変数が含まれ、もう1つにはTIMIの7つの変数が含まれ、3つ目にはランダムに選択された44の変数が含まれていました。
最初の 2 つのデータセットでは、ChatGPT は、TIMI と HEART の一定スコアと比較して、個々のシミュレートされたケースで 45% ～ 48% の確率で一貫性のないリスク評価を生成しました。しかし、3 番目のデータセットでは、複数回実行したにもかかわらず、ChatGPT は同じケースに対して異なる結果を返しました。
ヘストン博士は、研究結果が満足のいくものではなかったにもかかわらず、技術が進歩するにつれて医療における GenAI の可能性は高まると考えている。同博士によると、医療記録をシステムにアップロードでき、緊急事態が発生した場合、医師は ChatGPT に患者に関する最も重要な事実を提供するよう依頼できる。また、いくつかの可能性のある診断とそれぞれの理由を生成するよう依頼することもでき、医師が問題を理解するのに役立つ。

クリエイターからの情報をさらに見る

最新ニュース