Virtual Metrology Modeling for CVD Film Thickness

The semiconductor industry is continuously facing four main challenges in film characterization techniques: accuracy, speed, throughput and flexibility. Virtual Metrology (VM), defined as the prediction of metrology variables using process and wafer state information, is able to successfully address these four challenges. VM is understood as definition and application of predictive and corrective mathematical models to specify metrology outputs (physical measurements). These statistical models are based on metrology data and equipment parameters. The objective of this study is to develop a model predicting the CVD oxide thickness (average) for an IMD (Inter Metal Dielectric) deposition process using FDC data (Fault Detection and Classification) and metrology data. In this paper, two VM models are studied: one based on Partial Least Squares Regression (PLS) and one based on Tree ensembles. We will demonstrate that both models show good predictive strength. Finally, we will highlight that model update is key for ensuring a good model robustness over time and that an indicator of confidence of the predicted values is necessary too if the VM model has to be use on-line in a production environment.


Introduction
The semiconductor manufacturing industry has a largevolume multistage manufacturing system. To ensure high stability and high production yield, reliable and accurate process monitoring is required [1]. Advanced Process Control (APC) is currently deployed for factory-wide control of wafer processing in semiconductor manufacturing. The APC tools are considered to be the main drivers to guarantee a continuous process improvement [2].
However, most APC tools strongly depend on the physical measurement provided by metrology tools [3]. Critical wafer parameters are measured, such as, for example, the thickness and/or the uniformity of thin films. If a wafer is misprocessed in an early stage but detected at the wafer acceptance test, unnecessary resource consumption is unavoidable. Measuring every wafer's quality after each process step could avoid late wafer scraps but it is too expensive and time consuming. Therefore, metrology, as it is employed for product quality monitoring today, can only cover a small fraction of sampled wafers. Virtual metrology (VM) in contrast enables prediction of every wafer's metrology measurement based on production equipment data and previous metrology results [4][5][6][7]27]. This is achieved by defining and applying predictive models for metrology outputs (physical measurements) as a function of metrology and equipment data of current and previous steps of fabrica tion [8][9][10][28][29][30][31].Of course it is necessary to collect data from equipment sensors to characterize physical and chemical reactions in the process chamber. Sensor data will constitute the basis for the statistical models that will be developed. A typical Fault Detection and Classification (FDC) system collects on-line sensor data from the processing equipment by sensors for every wafer or batch. They are called process variables or FDC data. Reliable and accurate FDC data are essential in VM model [11]. The objective of a VM module is to develop a robust prediction that can provide estimation of metrology and which is able to handle process drifts whether they are induced by preventive maintenance actions or not. This paper deals with the prediction of PECVD (Plasma Enhanced Chemical Vapor Deposition) oxide thickness for an Inter Metal Dielectric (IMD) layers using FDC and metrology data. Two types of mathematical models are studied to build VM modules for PECVD processes. Partial Least Squares Regression (PLS) [12][13] and a non-linear approach based on Tree ensembles [14][15][16] are considered. The technical challenge and innovation are to build a single robust model, either with PLS or Tree ensembles, which is valid for several products, different layers and two different chucks. The alternative would be to make a model per layer, chuck and product, but we strongly believe that the maintenance of many single models, in our case 12 different models, is not compatible with the constraints of the industry.
Section II deals with fabrication process. In section III we present the mathematical background to build VM models. Results are described in sections IV. Some perspectives about what the next steps of this work will be are given in section V. Finally, section V concludes this paper with a summary.

Fabrication Process
The film layer under investigation for thickness modeling is part of the IMD used in the Back-End of Line (BEOL) of a 0.35µm technology process. This oxide layer is used three times during the production of a four metal layer device. PECVD USG (Undoped Silicon Glass) films are commonly used to fill the gaps between metal lines due to their conformal step coverage characteristics. However, as the device geometry is shrinking, the gap fill capability of USG films is no longer sufficient. State of the art technique is the combination of HDP (High Density Plasma) and USG films to provide a high-productivity and low-cost solution. HDP is used to fill the gap just enough to cover the top of the metal line and then the USG is used as a cap layer on top of the HDP oxide film [17].  Figure 1 shows the layer structure for one inter metal layer just before the Chemical Mechanical Polishing (CMP) step. The process steps are identical for all three stages. We use identical equipment production recipes, identical metrology setup and identical cleaning procedures for all three stages in the process flow. After metal deposition and structuring (lithography and etch) the HDP oxide is deposited. The HDP oxide film thickness is then measured by ellipsometry, using a 9-site template recipe. FDC data for VM modeling are collected during the USG deposition right after the HDP process. The full oxide stack is measured by the same ellipsometer tool also using a 9-site template recipe.
A schematic drawing of the process flow can be seen in figure 2. To guarantee the collection of a proper set of data within a few weeks, ten wafers per lot are measured before and after the USG process. The objective of the VM model is to use the predictive results as an input parameter for the following CMP process step. The CMP tool uses this input data for calculating the polishing time of each wafer and therefore the integrated layer thickness measurement could be skipped. The benefit of this VM implementation is an increased throughput and cost reduction. Wafers are processed in a twin-chamber of a PECVD tool. The same deposition recipe is used for the deposition of different inter-metal layers and several products. During wafer processing, the relevant process parameters that characterize the PECVD process, such as gas flows, pressure, temperature plasma parameters, etc., are gathered. These temporal data are then consolidated with statistical methods. The temporal data (sensors) are collected at a sampling rate of 0.5 second. If too many samples are missing during the data collection, the data are discarded and the wafer is not used in the VM modeling. The temporal data are then transformed into the so-called FDC Indicators. A FDC Indicator is the summarization of temporal data into a single point, based on a given algorithm (mean, range, maximum, minimum, slope, etc). A data set consists of data from production equipment (input data X for VM modeling) and metrology equipment (output data Y for VM modeling). To assure the quality and effectiveness of VM models it is necessary to do preliminary quality studies of process and metrology equipments like variance analysis or repeatability and reliability studies (R&R studies). In addition, context information like layer, product and chuck is essential as categorical input for VM modeling.
Input data X consists of 24 indicators and three contextual variables. The output variable Y represents the average of the PECVD oxide thickness of each wafer.

Notation
The following notation conventions are used in this paper: scalars are designated using lowercase italics. Vectors are generally interpreted as column vectors and are designated using bold italic lowercase (i.e. x). Matrices are shown in bold italic uppercase (i.e. X), where x ij , with (i=1,…, I) and (j=1,…, J), is the ij th element of X(I×J). Let X of p ℜ be an input data set and Y of m ℜ be arranged in the following way:

Virtual Metrology
. The characters I, J, N, p, q, m and n are reserved for indicating the dimension of vectors and matrices of data.

VM Modeling
There are some important points when designing the mathematical models and a methodology that should be considered. In this section we propose two successive stages to deploy mathematical models in order to build a VM Module for an individual process:

Data partitioning: Training set and Test Set
Let X (I×p) and Y (I×m) be the available data set (cleaned and normalized) respectively from production and metrology process. The data set partitioning consist in the extraction of two units: a unit of 70% of the data set for the training-validation (training and cross validation) and a unit of 30% of the data set for the test. Let X N (N×p) and Y N (N×m) be the training-validation data set, and let X N (n×p) and Y N (n×m) be the test data set with N+n=I. It is possible to split the available data set in a temporal way (chronological selection) without loss of representativeness. In this case study we have chosen this type of data partitioning before the application of the three mathematical models.
Alternatively, the Kennard-Stone method [15] can be used to perform the data set partitioning. The inputs variables domain X of p ℜ , ( ) is considered for the Kennard-Stone method.. It is a sequential method to select a training set uniformly which covers the entire X variable's space. The selection criteria use the Euclidean distance.

Mathematical Modeling
A linear regression model of a given process can be written as: is the matrix of regression coefficients and E is the matrix of errors whose elements are independently distributed with mean zero and variance σ 2 [18][19]. Linear or non linear regression methods can be applied to the matrices X and Y to compute the coefficient matrix B. The regression model is built in two levels: the Training-validation level with the training-validation data set and the test level with the test data set. The training of models that are linear with respect to their parameters (such as linear regressions, polynomials models) can be performed easily with the traditional least-squares method, whereas the training of models that are nonlinear with respect to their parameters (such as neural networks) requires more complex methods. More details about the training of mathematical models can be found in [16]. Global approaches to model selection in the training-validation level are Cross-validation [20] and Leave-One -Out, methods for estimating generalization error based on resampling [21]. It is obvious to perform the model selection on the basis of the Validation Root Mean Square Error on the Training-validation data set (VRMSE). The VRMSE is given by equation (2): where yi is the measured output value, ŷi is the estimated output value from the model, and N is the size of the training data set. The VRMSE is often used for comparing various models. In n-fold Cross-Validation the data are divided into n subsets of (approximately) equal size. The net is trained n times, where one of the training subsets is left out. Only the omitted subset is used to compute the error criterion of interest. If n is equal to the sample size it is called leave-one-out cross-validation. The prediction performance of the selected model is estimated using the test data set. The performance indicator is the Test Root Mean Square Error of Prediction (TRMSE) computed on the test data set: where y i is the measured output value, ŷ i is the estimated output value from the model, and n is the size of test data set.

PLS Models
Consider a set of historical process data consisting of an (I × p) matrix of process variable measurements (FDC data) X and a corresponding (I × m) matrix of metrology data Y. Projection to Latent Structures or Partial Least Squares (PLS) can be applied to the matrices X and Y to estimate the coefficient matrix B in (1).
Y is the PLS estimate of the process output Y. PLS modeling consists of simultaneous projections of both the X and Y spaces on low dimensional hyper planes of the latent components. This is achieved by simultaneously reducing the dimensions of X and Y, by seeking q (< p) latent variables which mainly explains covariance between X and Y. Therefore this method is useful to obtain a group of latent variables which explain the variability of both, Y and X. The latent variable models for linear spaces are given by Equations (5) and (6) [12]: (6) where E and F are error terms, T is an (I × A) matrix of latent variable scores, and P (p × A) and Q (m × A) are loading matrices that show how the latent variables are related to the original X and Y variables. The sample covariance matrix is X T YY T X. The first PLS latent variable t 1 = Xw 1 is the linear combination of the X-variables that maximizes the covariance between t 1 and the Y space. The first PLS weight vector w 1 is the first eigenvector of the sample covariance matrix X T YY T X. After the scores for the first component have been computed, the columns of X are regressed on t 1 to give a regression vector: In NIPALS (Nonlinear estimation by iterative Partial Least Squares) algorithm [13] the second latent variable t 2 , orthogonal to t 1 , is calculated from the new matrix of covariance X 2 T Y 2 Y 2 T X 2 , where X 2 and Y 2 are calculated by the equations (8) and (9) 1 is obtained by regression of the columns of Y in t 1 , i.e.: The second latent variable is computed by the equation t 2 =Xw 2 , where w 2 is the first eigenvector of the sample covariance matrix X 2 T Y 2 Y 2 T X 2, and so on. The new latent vectors or scores (t 1 , t 2 ,…) and the weight vectors (w 1 , w 2, …) are orthogonal. The final models for X and Y are given by Equations (5) and (6).
Latent variable models assume that both the process and metrology data spaces are observed with error and that both are effectively of very low dimension (i.e. non-full rank). The dimension A of the latent variable space is often quite small compared with the dimension of the process data space, and it is determined by cross-validation or some other procedure. Effectively, these models reduce the dimension of the problem through a projection of the high-dimensional X and Y spaces onto the low-dimensional latent variable space T, which contains most of the important information [12].

Tree Ensemble Models
It has been shown by Breiman et al. [22], in the classification case, that under reasonable assumptions, an ensemble procedure allows getting accurate models. Indeed, if the base model has a low-bias and high variance under some random perturbation of the learning conditions, then aggregating a large family of such models give birth to a low-bias, low variance aggregated model, that is more accurate than the individuals models [15].
To allow such results to hold, it is critical that the individual models are as independently built as possible, while maintaining low bias. Tree base learners, either based on algorithms such as CART [22] or C4.5 [23], are known to have a low bias when fully learned (no pruning) [24]. In order to be able to build families of trees that have a low correlation to one another, from a finite dataset, several methods have been proposed: Bootstrapping the learning set (also known as bagging methods), Random splits, Injecting random noise in the response or building random artificial features as (linear) combinations of the existing ones. All these ideas aim at learning trees that are as uncorrelated as possi-ble.
Following Breiman [22], we use here a combination of bagging from the base learning set, and random splits as our main ensemble method. Base learners are regression trees, following a modified CART algorithm for tree learning. Given X, a set of (I × p) FDC data, and a corresponding Y (I × m) metrology, and 2 parameters q (random selection among features at the individual split level) and nTrees (number of trees grown and aggregated), the algorithm is as follows: 1. Iterating over the m responses: 2. Looping 1 -> nTrees: a) Build a bootstrap sample (I × p) X b and corresponding Y b (I × 1) response b) Build a fully-grown tree τ, following modified CART algorithm i. randomly selecting q<m candidates for a given split inside a node ii. Select the best split among the q candidates as the one that reduces most the residual variances over the 2 children nodes iii. Recursively until stopping criterion is reached, i.e. node is pure (internal variance equal 0) 3. Average the predictions, i.e.
Bagging allows the calculation of the so-called out-of-bootstrap prediction, which is very similar to cross-validation or leave-one-out, since predictions on the learning set are derived by averaging, for each individual, the set of trees in which this individual is not in the bootstrap sample. Hence bagged ensembles have an internal estimation of their generalization error. A well-known property of trees is their inability to model linear effects, which is why, when a strong linear effect from one or several parameters is discovered in the data, we build the tree ensemble on the residual from the main linear effect.
Finally, tree ensembles have internal estimations of the importance of each of the feature that are calculated by averaging their out-of-bootstrap contribution in the prediction for each tree. More precisely, one can estimate the increase in out-of-boostrap error that scrambling one parameter would produce, over several repetition of the scrambling procedure. This allows dropping low importance features from the model by comparing features importance to probes (random features). In the end, the model will be: where lin and nlin are disjoint subsets of the initial set of indicators. R 2 is defined by equation (18) (13) where y i is the measured output value, ŷ i is the estimate output value by model, N is the size of training data set and S y 2 is the variance of y. This metric is estimated from the out-of-bootstrap predictions.

Results
In this section we present the results of two different models (PLS and tree ensembles) for prediction of the PECVD oxide film thickness.

PLS Models with 3 Qualitative Variables
A PLS model is built without input variable selection. The model is calibrated with 306 wafers of the training validation data set. The 168 wafers of the test data set are used to validate the model. The input variables of the PLS model are the 24 quantitative FDC indicators and the three qualitative variables which are chuck, layer and product. The PLS model with four principal components is selected by cross-validation. Q 2 (cum) is increasing for the first four principal components and decreasing for the fifth principal component. Actually, using the first four principal components, 46.5% of the X variability (quantitative and qualitative variables) can explain 89.7% of the output Y (average thickness) variability (see table I). Therefore the best statistical model is achieved by using only four principal components. Analyses have been done on each parameter to quantify its importance. In table II the five most important variables can be found. The trained model is applied to the test data set. In figure 3 the measured and the predicted average oxide film thickness for PLS model are shown. The VRMSE and TRMSE are around 0.53% and 0.58% of the average thickness, respectively. Figure 3 shows the predicted average oxide film thickness using the results of PLS model versus the measured average thickness.

Tree Ensembles Model
Modeling is done using the learning set of 306 wafers. After one iteration of the algorithm, one indicator (X10) is selected for the prelinear part; the remaining 26 are left for tree ensemble modeling, including the three qualitative parameters chuck, product and layer. The second iteration of the algorithm provides a model that selected five parameters: X10 in the linear part of the model, and four in the tree ensemble model (X9, X8, X24 and X25). X24 and 25 are two qualitative parameters (see table III). R 2 is estimated to be 0.84, being defined as equation (13). This metric is estimated from the out-of-bootstrap predictions. Other model quality metric include VRMSE, estimated also from the out-of-boostrap predictions, measured at 0.69% of the average thickness for this model.
The model is then used to predict average oxide film thicknesses for the test set. Figure 4 shows the result of the tree ensembles model. The TRMSE is comparable to the VRMSE and is equal to 0.59% (see table IV). R 2 is calculated to be 0.84 also, for the test set.

Perspectives
In this paper we have been comparing two different types of mathematical modeling (PLS vs. Tree Ensembles). This academic study can be considered as the first step for virtual metrology, and demonstrates that here, with reasonable parameter selection; different algorithms yield similar results in terms of prediction capacity. So modeling capacity will contribute a low part to the algorithmic choices to be made in virtual metrology, we think. However, two main challenges remain to be addressed before using online virtual metrology prediction in Fab environment. The first one is to ensure VM models will be robust enough with time; this has to deal with model update approaches. The second one is to guarantee that predicted values can be trustfully used; this will be addressed by developing an indicator of confidence for each predicted value. The ultimate goal is to provide a reliable prediction that can be used for the CMP step.
The model robustness over time is a key topic that must not be neglected. Many factors can impact the model validity such as the chamber aging, a sudden chamber malfunction, as well as the unscheduled and scheduled preventive maintenances [25]. All these events might lead to a change with time of some collected variables, used as model input on the form of FDC indicators. It is difficult assessing ahead the impact that such changes might have on the model validity. For all these reasons, a static model, built on a leaning dataset with no further updates, doesn't seem to be a sustainable solution. One should prefer an approach based on dynamic models. In that case, many possibilities exist such as a regular update, a data-driven update (based on estimated quality of the model) and a chamber-driven update (based on maintenance events). The second challenge is to provide a predictive quality index in order to guarantee the accuracy of the model prediction. This quality index could be compared to the GOF (goodness-of-fit) available for each measure done in a metrology tool. The quality Index will be a combination of two metrics. In the case of Tree Ensembles modeling, a pre-selection of all available indicators is done; only indicators having a major contribution in predicting the metrology output are kept in the model and use for calculated the predicted metrology values. The others with no or less influence on the metrology output are left aside. However, if one of them changes drastically suddenly, as it might happen when a chamber malfunction is detected, the VM predicted value might be questionable. Indeed, this parameter might have a strong influence on the metrology though this parameter was not kept in the model due to a constant value over time, for instance. Should this happen, the corresponding quality metrics should reflect it. The second metric is related to the quality of the model prediction itself and can roughly be describes as an R-squared (R2). The combination of these 2 metrics should give an indication precise enough to determine whether or not the predicted value can be used in the feed-forward control loop for CMP removal step [26]. If the confidence in the predicted value is too low, thus the wafer should go with no doubt through a real measurement in the metrology tool.

Conclusions
This paper presents two mathematical models that have been used to develop virtual metrology for predicting the average oxide film thickness deposited during a PECVD process. The two models have good predictive strength. PLS and tree ensemble show equivalent performance on the test set, but PLS shows slightly better results on the learning set. This could be explained by the elimination of an outlier value (point with highest measure thickness) in learning set of PLS model. PLS uses four principal components, which are based on all the variables. The tree ensemble model uses five variables only. Three out of the top five most important variables of PLS are used in the tree ensemble model.
The predictive results are in excellent agreement with the measured data. In addition, we have shown as a novelty in virtual metrology that it is possible to create a single model for different layer, different products and for one chamber with two different chucks.
The first results we have had on model update techniques, as well as on building a predictive quality index are very encouraging to reach our final goal which is to have Virtual Metrology running online in the CMP step.

ACKNOWLEDGMENTS
This work was supported by ENIAC project IMPROVE (Implementing Manufacturing science solutions to increase equipment pROductiVity and fab pErformance). Funding by the EU, the FFG and the MINEFI is gratefully acknowledged.