Introduction
The duty of medical sciences is to treat ill health and promote good health in the community. Most requirements to achieving this goal are knowing the condition of the body, its responses to external and internal stimuli, and how they influence the internal and external factors of body’s system. With a deeper understanding of the human body, the actions and interactions of its various organs will be determined and understood. Nevertheless, these achievements are limited within the interval of possible experiments. Experimental results reveal the reflex of different organs (subsystems) into internal and external inputs within the body. Therefore, as soon as a change becomes stronger or wider, the number of subsystems involved in creating responses will increase. It should be noted that the environmental impacts are not often individual, and the factors that influence their interaction with subsystem will cause more complex responses to be issued by the body. This matter will be more complicated when the body is changed by underlying diseases in the normal population. This state will cause different responses and difficulties for predicting the body’s changes.
What a simple analysis gave to scientists in this field helped them understand and form conclusions based on medical information, was helpful for understanding the basic terms of the system on the whole, and aided in identifying and understanding the factors. With the possibility of collecting a patient’s information and the emergence of big data in several areas related to health, statistical tests lost their ability to analyze the situation and identify main factors [1]. As mentioned above, determining the complexity of the body and the interaction of this complex system is often accompanied by multiple external factors. This is followed by an ineffective statistical system in understanding and predicting reliable conditions, especially with multiple time variables and parameters [2]. The need for a top analysis system coincided with the emergence of data mining - the process of knowledge discovery- which was a mixture of machine learning, expert systems, statistics, etc. Such system showed a better understanding of the process and prediction of the future performance of complex systems with analysis of their efficiency in economic and military fields. In particular, data mining can demonstrate the underlying patterns of a system along with the functions of each subsystem in the face of changes [2]. The main goal of data mining is to extract hidden knowledge from a very large sets of data which is not possible to observe them with simple statistical analysis [3]. In fact, the data mining process makes it possible for owners of big data to better understand the dependency among the attributes of the samples in a big dataset and interpret the subsystem processes and to create laws, and predictions of the corresponding subsystem behavior [4].
Data Mining in Emergency Medicine
Emergency medicine is the front line of hospital medical services and is a department that people seek medical care immediately after an emergency. Data mining is a new technique that develop the artificial intelligence and database technique in recent years. It is focusing on database re-analysis including the aim of discovering the valuable information about unknown databases and also to determine the data pattern [1]. Data mining used in medical related research to explore the reduction of patient complaints which arise from insufficient and improper treatments. Therefore, data mining will upgrade the medical quality and also save the waste of medical resources. Shi et al., [5] showed that emergency triage and the scheduling shift of physicians by using data mining analysis will reduce the classification of noises and determine the classifying levels of triage by classification. Data mining technique will increase the consistence of triage classification in emergency medicine where they used three techniques of data mining to increase this consistence [6]. Computer system can be used to generate calls for reservation. Also, they found that data mining of patient’s treatments will help to inform thinking the nature work of emergency departments. The thinking by process-based were used to derive a simple model of emergency department operation [7].
The ways in which data mining helps medical sciences
In general, the areas of medical sciences that require data mining analysis can be categorized into the following items:
- Identifying the complex mechanisms of different body subsystems and their interactions with each other [8,9];
- Identifying people who are at risk for diseases of a genetic predisposition or caused by environmental factors [10];
- Identifying disease mechanisms and their interactions with the problems of the body [9];
- determining disease prognoses, and facilities management [11];
- Establishing decision support systems to make the best decision, especially when the disease is multi-factorial, when more factors are involved in determining the course of the disease, in emergencies, or in acute phases of a disease [10,11];
- Evaluating diagnostic and treatment tasks and relationships and identifying shortcomings and capabilities [12];
- Finding the best screening methods for diseases and injuries, particularly for patients in critical conditions [13].
Data mining is the result of using implemented algorithms in software to cover the needs of medical science in each section with the construction of analytical models, categorizing, information prognosis (prediction), and presentation. There are different techniques in data mining, but the following subjects will be used more in the discussion of analytical or predictive medicine: Classification, Regression, Clustering, Discovery the interpretable rules of dependencies, and Sequences Analysis (Table 1) [14].
Table 1. Categories of data mining methods and algorithms used in Medical Sciences
No.
|
Medical Field Research
|
Data mining method
|
Data Mining Algorithm
|
Description
|
1
|
Disease Prediction
|
Categories Clustering Rules of dependence
|
K-means
Apriori
|
Determine the factors affecting cancer types
|
2
|
Determine the best type of treatment
|
Categories Clustering Rules of dependence
|
K-means
Apriori
|
Determine the best type of cancer treatment through 3 methods: Surgery, Chemotherapy, Radiotherapy
|
3
|
Experimental Data Evaluation
|
Clustering
|
SAMBA
|
The study of genes behavior with DNA strands to predict genetic disorders and fatal diseases
|
4
|
Prediction in Emergency Patients
|
Clustering
|
K-mean combination and Neural Network SOMa Algorithm
|
To take proper and timely treatment decisions and reduce hospital costs
|
5
|
Length of Stay Prognosis
|
Categories Clustering
|
Decision Tree Neural Network Naïve Bays C 4.5
|
Anticipated duration of hospitalization digestive patients who need short-term care to reduce hospital costs
|
6
|
Identify and predict of disease symptoms
|
Clustering
|
Combination of Genetic Algorithm and K-means Algorithm
|
Identify and prediction of heart attacks
|
7
|
Traditional Chinese Medicine
|
Clustering
|
SVMb Decision Tree Bayesian Network
|
Discover the different syndromes
|
8
|
Traditional Chinese Medicine
|
Rules of dependence
|
CNAc
|
Find good points of Acupuncture and patterns of medicinal plants
|
9
|
Traditional Chinese Medicine
|
Clustering
|
SVM
|
Prediction of starting diabetic neuropathy
|
10
|
Traditional Chinese Medicine
|
Clustering
|
Neural Network
|
Treatment of rheumatoid arthritis
|
aSOM: Self-organizing Map; bSVM: Support Vector Machine; cCNA: Complex Network Analysis
The History of Data Mining in Medical Sciences
Nearly 10 years after the emergence of a data mining process in the field of trade, communication management and analysis of crime, it was first deployed in the field of health in the early 1990s. Apparently, data mining was used to identify trends, income and expenses of treatments; data mining capabilities were studied for monitoring and understanding clinical data [7, 15].
Along with the progress of the data mining process in computer science and statistics in medical sciences, evidence-based medicine was raised which was noticed by a medical team to generate usable data and create lots of information. Therefore, knowledge was supposed to be extracted from this information. At this time and because of the inability of linear statistics to produce information knowledge, data mining was considered as a tool for knowledge discovery [16-18].
In the years between 1997 and 1999, Prather [17] and Babic [18] showed the importance of data mining in monitoring medical information with an emphasis on a large volume of data. Almost from the time of publication of these articles and the clarification of the role of data mining in investigating medical information, a two-way communication was formed between specialists in the field of data mining and those who involved in the health sector. These people found the ability to explore relationships among data and predict complex processes. Data mining experts had found the area where there was big data, and data was studied on a regular basis under different conditions and with different parameters and variables. Their so far joint cooperation has led to impressive and operational achievements [19,20].
Over time, data mining has almost found its position in medical sciences. Bellazzi released a guideline for data mining in medical sciences with the aim of changing attitudes about the task of predicting patients' conditions in various fields [21]. In 2003, Hripcsak [22] showed that the use of data mining can increase patient safety and prevent medical errors. Lynch [15] states the importance of data mining and its ability to solve medical problems according to the capabilities of these techniques.
Discussion
In recent years, several studies have been carried out using data mining schemes in different medical fields. Engineers have evaluated the adequacy of data mining algorithms and models in different areas of health. Based on the different aspects mentioned at the beginning of this article, some studies are discussed below:
- A lot of data mining research project has been made on identifying the complex processes of the body, especially at the molecular level. This matter was originally considered by experts, especially with the advent of new technologies allowing people to have access to genetic information. Researchers have gathered a great deal of information about different gene sequences which can be analyzed with data mining techniques, and new knowledge in the field of system performance can be achieved according to their genetic formulation [16].
In the year 2012, Tilton successfully provided a software using genetic data through data mining which could model and predict the morphology of conceiving its genetic formula through analyzing data related to mRNA. This software is currently used and is being upgraded [23].
- Regarding the identification of people at risk, diseases such as cancer form one of the most popular data mining application areas in medicine from the aspects of genetic predisposition and environmental factors. Data mining is considered in the first days because of the possibility of analyzing various aspects of the situation of people with cancer and the importance of identifying the conditions that endanger people. In the early years, most clinical and radiological information of a patient was assessed using the cognitive development index and genetic information; data mining is mostly used to monitor the massive amount of data obtained from genetic analyses. In the year 2001, Kuo and Chang reviewed and categorized the results of sonographic findings in patients with breast cancer based on a decision tree [24]. They were able to find patients with breast cancer through data analysis and invented a forecasting system upon this tree and the findings of ultrasound examination and imaging precision [25]. Then, they designed a software system to predict malignant breast masses with ultrasound findings [24,25]. Asadi et al. [10] studied the factors leading to the development of cancer using data mining techniques and detected the associations between these factors with the information recorded in the cancer registry of Nemazee hospital in Shiraz city, Iran.
The latest achievements of data mining in the field of cancer focus more on the data of genomes and genetic proteins such as RNA and DNA.
In the year 2015, Moore successfully achieved patterns using a model which could predict a person’s risk of cancer with information from the mRNA analysis. He was also able to find the initial prediction model using the data mining process [16].
In 2016, Milioli successfully classified breast cancer using data mining through a genetic data bank. This work resulted in a better understanding of the pathology and diagnosis of cancer [26].
- Mechanisms of disease and how the body interacts with the problem have been the focus of several attempts. Dehghani et al. [27] used clustering techniques to identify patients with heart attack and to predict their heart attack by the combination of K-means and genetic algorithms. In another study, Apitus successfully predicted the incidence of neurodegenerative diseases through a mathematical model optimized by genetic algorithm from non-affected persons [28]. In the year 2016, Dipnall [29] successfully determined much significant data related to experimental data in patients with depression using data mining. A typical analysis of its relationship had not been discovered. Dipnall was able to find a significant relationship between the laboratory findings and the risk of depression through the use of a hybrid system with a mix of linear analysis and data mining techniques. Paydar et al., [30] by modeling could predict malignancy in thyroid nodules.
- Determining diseases and facilities management, including operational applications of data mining in medical fields, is another topic, the importance of which will be raised, especially in the limited time of care facilities, high cost of care, or low probability of response to the treatment. In 2011, Lin et al. [11] defined the conditions and outcomes that would lead to death despite the cost. The study examined the data of patients admitted to the emergency room, and follow-up costs through clustering patients and use of the nervous system.
The study of Delen et al., [12] modeled the prognosis of patients after lung transplantation through machine learning by using demographic data, operating conditions, and paraclinical findings before, during, and after transplantation. The purpose of modeling was to predict postoperative lung transplant candidates; through modeling, they were able to predict the likelihood of death in lung transplant patients. Therefore, subsequent decisions to perform very heavy and costly lung transplants could be made.
The progress of using data mining in data analysis and the interpretation of genetic explanations of humans caused challenges to predicting the risk of people for a disease, especially in cases where there was little predictive power or methods were not selected by appropriate categories. In 2006, Moore’s study found the needs to enhance the accuracy of the models and to ensure a high degree of confidence [12, 31].
- The existence of a decision support system to make better business decisions is important for researchers and decision-makers in the field of health. With the use of a decision support system, it is likely that fewer errors will occur, better decisions will be made, and better results will be achieved. In 2009, Canlas unveiled the efforts of Thungarel and Gorunescu who used the K-means clustering algorithm for diagnosing and treating cervical cancer [32].
The breast cancer diagnosis decision support systems proposed in 2001 by Kou et al., [25] have been previously discussed. Kou et al. have also done much to advance the use of decision support systems in emergencies, including decision support systems that McKenna and Chen [33] used in 2008 to quickly identify trauma patients suffering from blood loss or shock. These researchers provided a system that can accurately identify patients at risk for shock by using an ensemble classifier. They were able to offer an acceptable model using data mining [26, 34] because of the inadequacy of regular monitoring and the effective early detection of patients who were at risk for shock [35-37].
From another angle, there still exist some limitations regarding data mining techniques which prevent us to catch wonderful results in several cases. For instance, DB-SCAN is a very powerful clustering algorithm and is applicable for big data analysis like medical databases but it is not applicable for high dimensional data like bioinformatics data. On the other hand, K-Means is a good clustering algorithm but its performance is highly dependent on its distance function. Support vector machine (SVM) and neural network have amazing classification capabilities but each of them has its own shortcomings. For instance, SVM provides a statistical generalization capability and does not have any problem to handle high dimensional samples but its performance fails to handle more than 1000 samples. The main weakness of neural networks is that they over-fit on the training data and their test and train results have a big difference in some cases. Fuzzy classifiers and clustering techniques can work well in the uncertain environment, providing interpretable rules and can work well even with insufficient number of samples but they do not have any statistical generalization support and their train and test results might have a drastic difference in some applications. The same story repeats for regression methods in a way that adaptive boosting regression (Adaboost.R) and support vector regression (SVR) are great to encounter with low number of samples but they fail to handle big data. On the other hand, polynomial regression methods optimized by least square criterion act well in some applications but they do not consider a margin around its regression curve.
In conclusion, we have to confess that there is no perfect method that could handle wide variety of data sets with different specifications, i.e. containing large number of samples, high dimensional samples, data include high portion of noisy samples, two-class and multi-class situations, data with uni-modal and multi-modal distribution, etc. In contrast, there is a growing interest to improve the existing methods by combining and fusing them together to take the benefit from the positive points of each other.
Conflict of Interest: None declared.