Benchmarking Large Language Models for Scientific Writing: A Mixed‑methods Evaluation of ChatGPT, DeepSeek, and Human Authors

ZHILISBAYEVA, KULYASH R.; SHOKRPOUR, NASRIN; PARVIZI, MOHAMMAD MAHDI

doi:10.30476/jamp.2026.109761.2366

تعداد نشریات	21
تعداد شماره‌ها	1,285
تعداد مقالات	11,750
تعداد مشاهده مقاله	82,853,110
تعداد دریافت فایل اصل مقاله	121,727,321

	Benchmarking Large Language Models for Scientific Writing: A Mixed‑methods Evaluation of ChatGPT, DeepSeek, and Human Authors
Journal of Advances in Medical Education & Professionalism
دوره 14، شماره 3، مهر 2026، صفحه 281-290 اصل مقاله (1.23 M)
نوع مقاله: Original Article
شناسه دیجیتال (DOI): 10.30476/jamp.2026.109761.2366
نویسندگان
KULYASH R. ZHILISBAYEVA¹؛ NASRIN SHOKRPOUR^* ²^{، 3}؛ MOHAMMAD MAHDI PARVIZI⁴^{، 5}
¹Department of Languages, West Kazakhstan Marat Ospanov Medical University, Aktobe, Kazakhstan
²English Department, Faculty of Paramedical Sciences, Shiraz University of Medical Sciences, Shiraz, Iran
³Clinical Education Research Center, Shiraz University of Medical Sciences, Shiraz, Iran
⁴Molecular Dermatology Research Center, Shiraz University of Medical Sciences, Shiraz, Iran
⁵Department of Medical Journalism, School of Paramedical Sciences, Shiraz University of Medical Sciences, Shiraz, Iran
چکیده
Introduction: Modern large language models (LLMs) like ChatGPT (based on the GPT4 architecture) and DeepSeek offer unprecedented capabilities for generating scientific text. However, their performance in replicating structured, high-quality scientific writing, especially compared to human-authored abstracts, remains insufficiently evaluated. To compare the abstract quality produced by human authors, ChatGPT/GPT4, and DeepSeek across six evaluation criteria: Clarity, Coherence, Conciseness, Accuracy, IMRaD Structure, and Language Quality, using blinded expert ratings and non-parametric statistical methods, specifically the Kruskal–Wallis test followed by pairwise Wilcoxon rank-sum tests with false discovery rate correction. Methods: We selected 23 medical and healthrelated research topics, each yielding three abstracts (human, ChatGPT, DeepSeek), for a total of 69 abstracts. Three raters scored each abstract. Kruskal–Wallis tests assessed group differences; Cliff’s Delta (δ) was calculated as a nonparametric effect size for each comparison, suitable for ordinal data. Results: Across criteria, ChatGPT and DeepSeek significantly outperformed human authors in Clarity, Coherence, IMRaD Structure, and Language Quality. In contrast, Conciseness and Accuracy showed negligible effect sizes (\|δ\| <0.10), suggesting parity across all three sources. Conclusions: ChatGPT and DeepSeek achieved significantly higher scores in clarity, coherence, structure, and language quality, while showing comparable performance in conciseness and accuracy. These findings complement recent evaluations showing competitive medical and reasoning performance of DeepSeek models compared to proprietary LLMs. While shortform abstracts, expert oversight, and domain expertise remain critical, the results suggest that LLMs—particularly GPT4 and DeepSeek—can serve as effective tools in drafting scientific abstracts.
تازه های تحقیق
KULYASH R. ZHILISBAYEVA NASRIN SHOKRPOUR
کلیدواژه‌ها
Artificial intelligence؛ Natural language processing؛ Deep learning؛ Abstracting and indexing؛ Medical writing
سایر فایل های مرتبط با مقاله JAMP-14-3-281.xml JAMP-14-3-281-g001.jpg JAMP-14-3-281-g002.jpg
مراجع
Jin I, Tangsrivimol JA, Darzi E, Hassan Virk HU, Wang Z, Egger J, et al. DeepSeek vs. ChatGPT: prospects and challenges. Front Artif Intell. 2025;8:1576992. Azam S, Ali ME, Ahmad J, Mim MMJ, Sakib S, Fahad NM, et al. A review on large language models: Architectures, applications, taxonomies, open issues and challenges. IEEE Access. 2024;12:26839-74. Phogat R, Arora D, Mehra PS, Sharma J, Chawla D, editors. A comparative study of large language models: ChatGPT, DeepSeek, Claude and Qwen; 2025 3rd International Conference on Device Intelligence, Computing and Communication Technologies (DICCT); 2025. Khatri BB. Writing an effective abstract for a scientific paper. Nep J Dev Rural Stud. 2022;19(01):1-7. Majhi S, Sahu L, Behera K. Practices for enhancing research visibility, citations and impact: review of literature. Aslib J Inf Manag. 2023;75(6):1280-305. Luby S, Southern DL. Achieving clarity and conciseness. The Pathway to Publishing: A Guide to Quantitative Writing in the Health Sciences. Singapor: Springer; 2022. pp. 73-86. Sollini M, Pini C, Lazar A, Gelardi F, Ninatti G, Bauckneht M, et al. Human researchers are superior to large language models in writing a medical systematic review in a comparative multitask assessment. Sci Rep. 2025;16(1):173. Ripoll Y Schmitz LM, Sonnleitner P. Evaluating AI-generated vs. human-written reading comprehension passages: an expert SWOT analysis and comparative study for an educational large-scale assessment. Large-Scale Assess Educ. 2025;13(1):20. Madupati B, Jonnalagadda AK, Madupathi S, Kassetty N, Jakku PC, Raghavan P, et al. A technical comparison of ChatGPT and DeepSeek: Architecture, efficiency, and performance. Int J Glob Innov Solut. 2025. Keykha A, Behravesh S, Ghaemi F. ChatGPT and medical research: a meta-synthesis of opportunities and challenges. J Adv Med Educ Prof. 2024;12(3):135-47. Keykha A, Fazlali B, Behravesh S, Farahmandpour Z. Integrating artificial intelligence in medical education: a meta-synthesis of potentials and pitfalls of ChatGPT. J Adv Med Educ Prof. 2025;13(3):155-72. Jafari F, Keykha A, Taheriankalati A, Taghavi Monfared A. The role of AI in shaping medical education: insights from an umbrella review of review studies. J Adv Med Educ Prof. 2025;13(4):270-93. Khalifa M, Albadawy M. Using artificial intelligence in academic writing and research: An essential productivity tool. Comput Methods Programs Biomed Update. 2024;5:100145. Alizadeh M, Jafar Sameri M. Intelligent assessment systems in medical education: A systematic review. J Adv Med Educ Prof. 2025;13(3):173-90. Mir MM, Mir GM, Raina NT, Mir SM, Mir SM, Miskeen E, et al. Application of artificial intelligence in medical education: current scenario and future perspectives. J Adv Med Educ Prof. 2023;11(3):133-40. Khraisha Q, Put S, Kappenberg J, Warraitch A, Hadfield K. Can large language models replace humans in systematic reviews? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages. Res Synth Methods. 2024;15(4):616-26. Thelwall M. Research quality evaluation by AI in the era of large language models: advantages, disadvantages, and systemic effects – An opinion paper. Scientometrics. 2025;130(10):5309-21. Gao R, Yu D, Gao B, Hua H, Hui Z, Gao J, et al. Legal regulation of AI-assisted academic writing: challenges, frameworks, and pathways. Front Artif Intell. 2025;8:1546064. Lee C, Kim J, Lim JS, Shin D. Generative AI risks and resilience: How users adapt to hallucination and privacy challenges. Telemat Inform Rep. 2025;19:100221. Zhai C, Wibowo S, Li LD. The effects of over-reliance on AI dialogue systems on students' cognitive abilities: a systematic review. Smart Learn Environ. 2024;11(1):28. Salvagno M, Taccone FS, Gerli AG. Can artificial intelligence help for scientific writing? Crit Care. 2023;27(1):75. Yan J, Yan P, Chen Y, Li J, Zhu X, Zhang Y. Benchmarking LLMs against human translators: A comprehensive evaluation across languages, domains, and expertise levels. IEEE Trans Big Data. 2025;99:1-16. Yener H, Yener S. Overcoming Language Barriers With Artificial Intelligence. Harnessing AI for multigenerational English language learning. USA: IGI Global Scientific Publishing; 2026. pp. 97-122.
آمار تعداد مشاهده مقاله: 63 تعداد دریافت فایل اصل مقاله: 67

سامانه مدیریت نشریات علمی. قدرت گرفته از سیناوب

پیوندهای مفید

پیوندهای مفید

آمار

Benchmarking Large Language Models for Scientific Writing: A Mixed‑methods Evaluation of ChatGPT, DeepSeek, and Human Authors