Evaluation of liquefaction potential based on CPT results using C 4 . 5 decision tree

The prediction of liquefaction potential of soil due to an earthquake is an essential task in civil engineering. The decision tree has a structure consisting of internal and terminal nodes, which process the data to ultimately yield a classification. C4.5 is a known algorithm widely used to design decision trees. In this algorithm, a pruning process is carried out to solve the problem of the over-fitting. This article examines the capability of C4.5 decision tree for the prediction of seismic liquefaction potential of soil based on the Cone Penetration Test (CPT) data. The database contains the information about cone resistance (qc), total vertical stress (σ0), effective vertical stress (σ0 ′ ), mean grain size (D50), normalized peak horizontal acceleration at ground surface (amax), cyclic stress ratio (τ/σ0 ′ ) and earthquake magnitude (Mw). The overall classification success rate for all the data set is 98%. The results of C4.5 decision tree have been compared with the available artificial neural network (ANN) and relevance vector machine (RVM) models. The developed C4.5 decision tree provides a viable tool for civil engineers to determine the liquefaction potential of soil.


Introduction
Seismically induced liquefaction in saturated soils is a phenomenon in which soil loses much of its strength or stiffness due to rising pore water pressure for a generally short period of time but nevertheless long enough for it to cause ground failure.Liquefaction of saturated sandy soils during the earthquakes causes building settlement or tipping, sand blows, lateral spreading, ground cracks, landslides, dam and high embankment failures and many other hazards.Liquefaction occurrence depends on the mechanical characteristics of the soil layers on the site, the depth of the water table, the intensity and duration of the ground shaking, the distance from the source of the earthquake and the seismic attenuation properties of in situ soil [1].Determination of liquefaction potential of soil due to an earthquake is an important step for earthquake hazard mitigation.Because of the participation of a large number of factors that affect the occurrence of liquefaction during earthquake, the determination of liquefaction potential is a complex geotechnical engineering problem and has attracted considerable attention of geotechnical researchers in the past three decades.Several methods have been proposed to predict the occurrence of liquefaction.Many of these methods, are based on correlations between the in situ test measurements and observed field performance data, and are extensions of the "simplified procedure" pioneered by Seed and Idriss [2].Amongst in situ tests, many researchers have adapted Cone Penetration Test (CPT) results as the basis for evaluation of liquefaction potential of the test method [3,4].A primary advantage of the CPT is the nearly continuous information provided along the depth of the explored soil strata.The CPT is also considered more consistent and repeatable than other in situ test methods.Artificial intelligence (AI) techniques such as artificial neural network (ANN) [4][5][6], support vector machine (SVM) [6][7][8] and relevance vector machine (RVM) [9,10] have been used to develop liquefaction prediction models based on in situ test database.
These models have the ability to operate on large quantities of data and learn complex model functions from examples, i.e., by training on sets of input and output data.The greatest advantage of AI techniques over traditional modeling techniques is their ability to capture non-linear and complex interaction between variables of the system without having to assume the form of the relationship between input and output variables.In the context of determination of liquefaction occurrence, this method can be trained to measure the relationship between the soil and earthquake characteristics with the liquefaction potential, requiring no prior knowledge of the form of the relationship.Even though most of the introduced AI techniques have been successfully applied to CPT data, they do have shortcomings.For example, in the ANN approach, the optimum structure (e.g., number of inputs, hidden layers, and transfer functions) must be identified as a priori.This is usually done through a trial and error procedure.The other major shortcoming is the black box nature of ANN model and the fact that the relationship between input and output parameters of the system is described in terms of a weight matrix and biases that are not accessible to the user [11].Decision trees algorithms are quite transparent and also do not need optimization of model and internal parameters.Either a decision tree partitions or the input space of data set into mutually exclusive regions is assigned a label (classification tree) or a value to characterize its data points (regression tree).The decision tree has a structure consisting of internal and external nodes connected by branches.Each internal node is associated with a decision function to determine which node to visit next.Meanwhile, each external node, known as a terminal node or leaf node, indicates the output of a given input vector.Figure 1 shows partitions of the input space into four non-overlapping rectangular regions, and each of which is assigned a labeled class 'Ci'.C4.5 introduced by Quinlan [12] is a known algorithm widely used to design decision trees.This paper investigates the capability of C4.5 decision tree for the prediction of liquefaction potential of soil based on CPT data.

Decision trees
Decision trees are fast and easy to use.The rules generated by decision trees are simple and accurate for most problems.Therefore, decision trees are very popular and powerful tools in data mining [14].In general, a decision tree is a tree in which each branch node represents a choice between a number of alternatives and each leaf node represents a classification or decision [15].An unknown (or test) instance is routed down the tree according to the values of the attributes in the successive nodes.When the instance reaches a leaf, it is classified according to the label assigned to the corresponded leaf.In the first stage of model construction, a decision-tree induction algorithm is used to build the tree.Many algorithms for decision tree induction exist.Interactive Dichotomizer version 3 (ID3) and Commercial version 4.5 (C4.5) [13,16] are the most widely used with the classification and regression tree (CART) algorithm [17].C4.5 algorithm is an extension of ID3 algorithm and the divide-and-conquer approach [12] whose main improvements included the pruning methodology and the processing of numeric attributes, missing values and noisy data.The construction phase is begun at the root node where each attribute is evaluated using a statistical test to determine how well it can classify the training samples.The best attribute is chosen as the test at the root node of the tree.A descendant of the root node is then created for each either possible value of this attribute if it is a discretevalued attribute or possible discretized interval of this attribute if it is a continuous-valued attribute.Next, the training samples are sorted to the appropriate descendant node.The process is repeated using the training samples associated with each descendant node to select the best attribute for testing at that point in the tree.This forms a greedy search for a decision tree, in which the algorithm never backtracks to reconsider earlier node choices.Although it is possible to add a new node to the tree until all samples assigned to one node belong to the same class, the tree is not allowed to grow to its maximum depth.A node is only introduced to the tree only when there are a sufficient number of samples left from sorting.After the complete tree is constructed, a tree pruning is usually carried out to avoid data over-fitting.A statistical test used in C4.5 for assigning an attribute to each node in the tree also employs an entropy-based measure.The assigned attribute is the one with the highest information gain ratio among attributes available at that tree construction point.The information gain ratio Gain Ratio(A, S) of an attribute ′A′ relative to the sample set S is defined as Where and S a is the subset of S for which the attribute A has the value a. Obviously, the information gain ratio can be calculated straightaway for discrete-valued attributes.In contrast, continuous-valued attributes are needed to be discretised prior to the information gain ratio calculation.

Database
The database [5] used in this study consists of total 109 cases, 74 of them are liquefied cases and 35 of them are non-liquefied cases.The database contains: cone resistance (q c ), total vertical stress (σ 0 ), effective vertical stress (σ 0 ′ ), mean grain size (D 50 ), normalized peak horizontal acceleration at ground surface (a max ), cyclic stress ratio (τ/σ 0 ′ ) and earthquake magnitude (M w ).The range of values associated with each input variable is shown in table 1. Generally in pattern recognition procedures (e.g., ANN, SVM or GP) it is common that the model construction is based on adaptive learning over a number of cases and the performance of the constructed model is then evaluated using an independent validation data set.Therefore, in the present study, a total of 74 datasets are considered for the training dataset, and other datasets are considered for the testing dataset.The training and testing datasets are the same as the ones used by Goh [5] and Samui [9].[5] and Samui [9], respectively (see Table 4).The ANN uses many parameters, such as the number of hidden layers, number of hidden nodes, learning rate, momentum term, number of training epochs, transfer functions, and weight initialization methods.Though the RVM has lower parameters compared with ANN, but RVM requires a selection of a suitable kernel function first and then setting of the specific parameters and these processes are time consuming.Moreover, these techniques will not produce an explicit relationship in the variables and thus, the developed model provides very little insight into the basic mechanism of the problem.Decision trees algorithms are quite transparent and also do not need optimization of model and internal parameters.The developed C4.5 decision tree, figure 2, can be used by geotechnical engineering professionals with the help of a spreadsheet to evaluate the liquefaction potential of soil for a future seismic event without going into complexities of model development whereas the available ANN and RVM models do not provide any explicit equations for professionals.Also in the C4.5 approach normalization or scaling of the data is not required, but is an advantage over ANN and RVM approach.The limitations of the C4.5 decision trees need to be mentioned as well.Similar to other artificial intelligence techniques, decision trees have a limited domain of applicability and are mostly case dependent.Therefore, their generalization is limited and they are only applicable in the range of training data.However, the C4.5 model can always be updated to yield better results, as new data becomes available.

Conclusions
Liquefaction in soil is one of the major causes of concern in geotechnical engineering.The cone penetration test has proven to be an effective tool in characterization of subsurface conditions and analysis of different aspects of soil behavior, comprising estimating the potential for liquefaction on a specific site.In this paper, the C4.5 decision tree is used to predict the liquefaction potential of soil based on CPT data.The C4.5 model was trained and validated using a database of 109 liquefaction and non-liquefaction field case histories for sandy soils based on CPT results.The overall classification success rate for the entire data set is 96.3% and is comparable with those calculated using ANN and RVM models which were taken in the literature.Unlike available ANN and RVM models, the proposed model provide easily interpretable tree structure that can be used by geotechnical engineering professionals with the help of a spreadsheet to predict the liquefaction potential of soil for future seismic event without going into the complexities of model development using C4.5 decision tree.This model can be adopted for modeling different problems in geosciences.

Figure 1 .
Figure 1.An example of a decision tree for classification (a) binary decision tree (b) feature space partitioning [13].

Figure 2 .
Figure 2. Decision tree generated by C4.5 algorithm.a Number of cases in this partition.b Number of cases misclassified.