| تعداد نشریات | 21 |
| تعداد شمارهها | 1,281 |
| تعداد مقالات | 11,708 |
| تعداد مشاهده مقاله | 82,572,114 |
| تعداد دریافت فایل اصل مقاله | 121,488,169 |
Benchmarking Large Language Models for Scientific Writing: A Mixed‑methods Evaluation of ChatGPT, DeepSeek, and Human Authors | ||
| Journal of Advances in Medical Education & Professionalism | ||
| دوره 14، شماره 3، مهر 2026، صفحه 281-290 اصل مقاله (1.23 M) | ||
| نوع مقاله: Original Article | ||
| شناسه دیجیتال (DOI): 10.30476/jamp.2026.109761.2366 | ||
| نویسندگان | ||
| KULYASH R. ZHILISBAYEVA1؛ NASRIN SHOKRPOUR* 2، 3؛ MOHAMMAD MEHDI PARVIZI4، 5 | ||
| 1Department of Languages, West Kazakhstan Marat Ospanov Medical University, Aktobe, Kazakhstan | ||
| 2English Department, Faculty of Paramedical Sciences, Shiraz University of Medical Sciences, Shiraz, Iran | ||
| 3Clinical Education Research Center, Shiraz University of Medical Sciences, Shiraz, Iran | ||
| 4Molecular Dermatology Research Center, Shiraz University of Medical Sciences, Shiraz, Iran | ||
| 5Department of Medical Journalism, School of Paramedical Sciences, Shiraz University of Medical Sciences, Shiraz, Iran | ||
| چکیده | ||
| Introduction: Modern large language models (LLMs) like ChatGPT (based on the GPT4 architecture) and DeepSeek offer unprecedented capabilities for generating scientific text. However, their performance in replicating structured, high-quality scientific writing, especially compared to human-authored abstracts, remains insufficiently evaluated. To compare the abstract quality produced by human authors, ChatGPT/GPT4, and DeepSeek across six evaluation criteria: Clarity, Coherence, Conciseness, Accuracy, IMRaD Structure, and Language Quality, using blinded expert ratings and non-parametric statistical methods, specifically the Kruskal–Wallis test followed by pairwise Wilcoxon rank-sum tests with false discovery rate correction. Methods: We selected 23 medical and healthrelated research topics, each yielding three abstracts (human, ChatGPT, DeepSeek), for a total of 69 abstracts. Three raters scored each abstract. Kruskal–Wallis tests assessed group differences; Cliff’s Delta (δ) was calculated as a nonparametric effect size for each comparison, suitable for ordinal data. Results: Across criteria, ChatGPT and DeepSeek significantly outperformed human authors in Clarity, Coherence, IMRaD Structure, and Language Quality. In contrast, Conciseness and Accuracy showed negligible effect sizes (|δ| <0.10), suggesting parity across all three sources. Conclusions: ChatGPT and DeepSeek achieved significantly higher scores in clarity, coherence, structure, and language quality, while showing comparable performance in conciseness and accuracy. These findings complement recent evaluations showing competitive medical and reasoning performance of DeepSeek models compared to proprietary LLMs. While shortform abstracts, expert oversight, and domain expertise remain critical, the results suggest that LLMs—particularly GPT4 and DeepSeek—can serve as effective tools in drafting scientific abstracts. | ||
تازه های تحقیق | ||
| کلیدواژهها | ||
| Artificial intelligence؛ Natural language processing؛ Deep learning؛ Abstracting and indexing؛ Medical writing | ||
| مراجع | ||
| ||
|
آمار تعداد مشاهده مقاله: 8 تعداد دریافت فایل اصل مقاله: 11 |
||