## Statistics PhD Alumni 2007:

### Wei Liu (2007)

TITLE: Statistical Network Comparison

ABSTRACT: The study of dynamical random networks (graphs) has attracted a lot of attention in recent years. Statistics is challenging in this context, because in general only a very small number of observed networks is available. An important statistical problem, considered in this research, is to assess topological dissimilarities between networks.
The proposed approach assesses topological dissimilarities between networks indirectly. The structure of the given networks is destroyed by adding noise (this process is called "scrambling"). The amount of noise necessary in order to make the topologies of the scrambled networks statistically indistinguishable is used as a dissimilarity measure.
To follow this approach one has decided on its basic ingredients, such as the way to introduce noise, the way to measure the amount of noise, and the test statistic for comparing topologies of the scrambled networks. Three scrambling methods are proposed that to a certain extend allow to control the level of scrambling imposed on a network. Topologies of networks are compared via the spectral distributions of their (standardized) adjacency matrices. In fact, moments of these spectral distributions are utilized for testing purposes. This is motivated by a recent result of Bai and Yao (1) who derive a functional central limit theorem for an empirical spectral process based on Wigner matrices indexed by analytical functions. We have extended their results slightly to allow for constant diagonal elements in these matrices. This then allows the application of this result to (standardized) adjacency matrices of networks (graphs) without self-edges.
The proposed methodology is evaluated via simulation studies using model based networks and are further applied to some protein-protein networks.

Reference: (1) Bai, Z. and Yao, J. On the convergence of the spectral empirical process of Wigner matrices. Bernoulli 11, 1059-1092 (2005).

### Candace Metoyer (2007)

TITLE: Estimation Methods for Linear, Nonlinear, and Multidimensional Time Series: Applications of State-Space Modeling

ABSTRACT: Burman and Shumway (2004) use penalized least-squares to generate estimates for the trend-only linear time series model, Y(t) = T(t) + e(t), where T(t) is called the trend and e(t) is random error. We extend their approach and apply it to the trend plus seasonal linear time series model, Y(t) = T(t) + S(t) + e(t), where S(t) is called the seasonal. We assume that the d-th order trend differences are iid random variables and we assume that the p-th order seasonal sums are iid random variables. Using penalized least-squares, we obtain closed-form expressions for the trend and seasonal estimators. Next, we generalize this method further and consider the class of time series where the distribution of the observation is a member of the exponential family of distributions. We focus on Poisson and Bernoulli time series problems and present an estimation procedure based on the penalized log-likelihood. Last, we consider the class of time series where the observation is a column vector of length M. In this scenario, our first task is dimension reduction. Using a principal components analysis, we reduce the effective dimension from M to m < M, which gives rise to a type of co-integration model. We provide heuristic asymptotic results for all of the estimators and we present applications to real data.

### Lu Wang (2007)

TITLE: Penalization and Rank Reduction

ABSTRACT: The Penalized Total Least Square estimator is based on two types of well-known least square estimator: Penalized Least Square estimator and Total Least Square estimator for unknown response surface with additive noise. We begin by formulating the estimation problem as the rank constrained minimization of a penalized least square problem in order to achieve the Penalized Total Least Square estimator, which leads to consider further classes of candidate estimators for the unknown means in order to achieve lower risk. Adaptation selects the estimator within a candidate class that minimizes the estimated risk, which is an unbiased estimator of the risk function. Under the model assumption, such adaptive estimators minimize risk asymptotically over the class of candidate estimators as the number of rows of the matrix tends to infinity. The so called penalized total least square estimator is applied on both simulated data and real data, both out performs the traditional method in the sense of minimizing risk.
$\theta-$Separable estimator generalizes the idea of penalized least square estimator into broader class. This section deals with the following approach for estimating the mean $m$ of an $n-$dimensional random vector $x$: first, a family $\{A(\theta): \theta \in \Theta\}$ of $n \times n$ matrices is defined. The so called $\theta-$separable matrix depend on a $p\times 1$ unknown parameter vector $\theta$, and have special structure on the eigenvalues and eigenvectors. Examples of such an estimator includes: ANOVA model, ridge regression and multiple shrinkage estimator. Then, James-Stein estimation is introduced as minimization of the risk function. An element $A(\tilde{\theta}): \tilde{\theta} \in \Theta$ is selected by minimizing the $L_2$ risk function. Because the risk function involves the unknown parameter $m$ and variance of noise, instead of minimizing the risk function, $A(\hat{\theta}): \hat{\theta} \in \Theta$ is selected by minimizing the estimated risk function, which is a uniform consistent estimator of risk function. Estimators selected by minimizing estimated risk is also known as Mallows $C_L$ procedure. Generalized Cross Validation methods are also introduced.
The two methods are compared both asymptotically and by numerical experiments.

### Jingjing Ye (2007)

TITLE: Preprocessing and Biomarker Detection Analysis for Biological Mass Spectrometry Data

ABSTRACT: Biomarker detection using mass spectrometry has been billed as having high potential to improve public health. It has also presented considerably great challenges in the statistical analysis of the data with high dimensional data, massive file sizes, noise and complexity. In this dissertation, I propose methods of preprocessing the spectral data to overcome the difficulties for the purpose of extracting valuable information contained in mass spectrometry data.
In this talk, we propose a five-step preprocessing algorithm developed for mass spectrometry M/I data. The algorithm consists of imputation of missing intensities, normalization, integration of fractions, transformation, and selection of potential biomarkers. The five-step preprocessing on the M/I spectra is carried out on mass spectrometry glycomics data, a new emerging research area for detecting biomarkers. The proposed imputation can retain similar information to the raw spectrum and the selection of biomarkers based on statistical models is explored. The algorithm is applied to glycomics prostate and ovarian cancer data with selection of biomarkers incorporated in cross-validation for evaluation. With low misclassification error rates, good precision, and visually and clinically confirmed oligosaccharides detected in the process, we can conclude that the five-step M/I spectrum algorithm is a good choice in preprocessing and conducting differential expression analysis on mass spectrometry data.
Moreover, the methods of linear combination of selected potential biomarkers to achieve better classification are proposed. We investigate a non-parametric approach of maximizing the area under the curve with constrained threshold gradient direct regularization (TGDR-AUC) on the mass spectrometry ovarian glycomics data. Simulations of the method are conducted and proved asymptotic approximation of parameters. In the application of ovarian cancer case, TGDR-AUC is shown to have superior classification in both small biomarker large sample size and large biomarker small sample size scenarios. The method can detect clinical biomarkers, which are confirmed to be oligosaccharides, and provide the flexibility of build-in dimension reduction technique.
The talk shows step by step to mass spectrometry users about preprocessing procedures and biomarker detection methods based on the data. Our proposed methods can solve the purpose of preprocessing M/I spectrum and performing differential expression analysis on the outcome of disease. Thus, the methods are competitive in the analysis of mass spectrometry biological data and if implemented in the software, it will be available for mass spectrometry users to conduct their analysis.