Two heuristic approaches to describe periodicities in genomic microarrays

In the first part we discuss the filtering of panels of time series based on singular value decomposition. The discussion is based on an approach where this filtering is used to normalize microarray data. We point out effects on the periodicity and phases for time series panels. In the second part we investigate time dependent periodic panels with different phases. We align the time series in the panel and discuss the periodogram of the aligned time series with the purpose of describing the periodic structure of the panel. The method is quite powerful assuming known phases in the model, but it deteriorates rapidly for noisy data.


Introduction
Time dependent processes have a wide range of applications in most disciplines of life science such as courses of diseases, concentrations of toxic substances or genetic activity in cells.The investigation of such processes is often done in parallel for several patients, tissues or genes, so that several time series are produced.Considering the same experiment for different patients or genes one can expect common structures for all or at least groups of the time series contained in the panel.Those structures can for example be trends or periodicities.
With the introduction of the microarray technique for monitoring the genetic activity in cells, large data panels are produced both in biology (Spellman et al., 1998;Alter et al., 2000) and medical research (Cooper, 2001).There are both time dependent and time independent studiess.For time dependent experiments one gets panels of time series that typically contain a much larger number of time series than time points.If there are common structures in the panel the isolated investigation of single time series can lead to loss of information because the knowledge of the common structures is not used.In this paper we will discuss two heuristic approaches to benefit from common structures in data panels.Motivated by the investigation of cell cycles, we study periodicities in a genetic microarray sample.
One of the most used benchmark data sets in the analysis of microarray data is the monitoring of the cell cycles for the yeast Saccharomyces cerevisiae by Spellman et al. (1998).Even if this data set is quite old and not studied with a medical problem in mind it is well suited for illustrating and comparing techniques in the analysis of microarray data.Alter et al. (2000) use these data to develop a normalization method of the data set by filtering with respect to a singular value decomposition (SVD).One can get the impression that this is an attempt to extract periodically expressed genes, even though they are not periodic.In the first part we investigate, using simulated data, how the normalization influences the data.We show that the interpretation of SVD-normalized data is not as simple as it may look.So we hope that our work can contribute to reduce the number of pitfalls in the interpretation of microarray data.
On the other hand the discussion whether there are periodicities raises the question how such periodicities should be detected and estimated in a data set.Little has been done in this field and in the description of time dependent microarrays, generally.There are tests to detect periodicities in each single time series like that of Fisher (1929) with applications to panel data sets like microarrays, see e.g.Wichert et al. (2004).These tests are based on the periodogram, i.e. the Fourier analysis which generally deals with periodicities in time series (Brillinger, 1981).They are mostly concerned with the description of one function.There are many cases where it is reasonable to assume that there are more than one process going on in the same basic cycle or with the same base frequency, like the cell cycle in gene expressions or many phenomena connected to daily or annual cycles in biology or geophysics.It is therefore desirable to develop methods to take advantage of the similarities.Such a process often contains several frequencies and in such a case the periodogram calculated from the data gives important information.Diggle and al Wasel (1997) discuss the approach of average periodograms in a panel of biological time series.The problem is that time series in microarray analysis are often very short such that the number of frequencies which can be investigated is very limited.We will discuss an approach to avoid this problem for panels of periodic time series where the frequency structure is the same for all of the time series, but the phase may vary in the panel.By aligning the time series so that the aligned series have (approximately) the same phase, we produce a data vector with very dense observations so that we also can resolve higher frequencies.
2 The singular value decomposition approach Alter et al. (2000) discuss the structure of microarray data sets using the SVD approach.They detect a cyclic behavior in SVD-normalized gene expressions of the yeast Saccharomyces cerevisiae monitored by Spellman et al. (1998).After a brief introduction to the mathematical procedure they use, we will discuss the interpretation of the results using several simulated data sets.

Normalization of gene expression microarrays using SVD
The SVD is a multivariate technique to transform a matrix into a representation in an orthogonal system.Given a matrix Y = {Y ij } i=1,..,N, j=1,..,M containing the expression of N genes at M time points and defining into an N × L eigengene-matrix U, a diagonal L × L eigenexpression-matrix D and an M × L eigenarraymatrix V.The columns of U and V are eigengenes and eigenarrays, respectively.This decomposition is also known as the Karhunen-Loéve decomposition or the Principal Component Analysis (PCA) if the matrix Y is quadratic and positive semi-definite, i.e. a covariance matrix.The diagonal values of the eigenexpression-matrix determine how much of the behavior of Y can be explained by the corresponding eigencomponents in the orthogonal system.Manipulating the eigenexpressions, for example, by removing components, we can filter the matrix Y with respect to the properties described by the filtered eigencomponents.This idea is used in Alter et al. (2000) to normalize microarray data sets by removing undesirable components.We will now briefly introduce the procedure with its main steps to enable us to discuss the results in that paper and extend it to simulated examples.We will not describe the procedure comprehensively and ignore some aspects (pattern inference, degenerated subspace rotation).There is a very detailed description of the entire procedure in Alter et al. (2000).
Let us consider the panel represented in an experiment with realizations y = .,N, j=1,..,M is a zero mean noise term.
1. Normalization of the trend.Let y be a realization of model (2.2).We compute the SVD (2.1) y = udv T .
Motivated by the shape of most microarray data sets we assume M < N such that we have L = M .We will denote the columns {u ij } i=1,...,N and {v ij } i=1,...,M by u j and v j , respectively, and the jth diagonal element of d by d j .To normalize the data we remove the first eigencomponent from the data set, This first eigenexpression represents the eigencomponent explaining the largest proportion of the variance in the data set.In the investigations of Alter et al. (2000) it is interpreted as a constant trend in the data set.
2. Normalization of the variation.Using the trend-normalized data panel y C , we introduce We normalize this scaling parameter as above by computing the SVD of y LV removing its first eigencomponent which represents the largest variation in the data set, and transforming it back, where sgn(y C,ij ) denotes the sign of y C,ij .The panel y N = {y N,ij } is now normalized both with respect to the trend (2.3) and the variation (2.4).
3. Data Sorting.After the normalization we sort the genes with respect to the relative correlation ψ i between their normalized expression y N,i and the first two eigenarrays v N,1 and v N,2 of the normalized panel, i.e. we compute the SVD of the normalized data and then the relative correlation ψ i given by Looking at the two correlations in (2.5) as Cartesian coordinates the relative correlation ψ i is the corresponding angle in polar coordinates.Note that the pair always will be inside the unit circle, as shown in Alter et al. (2000).
We next explore the properties of the technique used in Alter et al. (2000) by applying it to simulated data sets.
2.2 Application of the SVD approach to different data sets Alter et al. (2000) apply the SVD-normalization to the gene expression data of the yeast Saccharomyces cerevisiae monitored by Spellman et al. (1998).This data set is often used as a benchmark in microarray analysis.Different synchronization techniques are used.We will concentrate here on one of them, the elutriation synchronization.
The data set contains the gene expression for N = 5981 genes measured at 30min intervals in approximately one cell cycle with the period length T ≈ 390min, i.e. we have M = 14 time points.Spellman et al. (1998) classify 784 genes as cell cycle regulated.The behavior of these genes with respect to the normalization and sorting procedure is comprehensively discussed by Alter et al. (2000).Very interesting are the different periodicities appearing in the results.Even if not all genes are classified as cell cycle regulated the normalized and sorted data set is periodic both in time and over genes as we can see in the upper left plot of figure 1.In addition there is a discussion of how well the elutriation synchronization is able to extract the periodic properties of gene expression for example in Shedden and Cooper (2002).The arguments are supported by empirical studies in Wichert et al. (2004) and Aßmus (2006).This leads to the question of how to interpret the different periodicities in the normalized and sorted data.
We investigate this model by examining several functions f i and noise terms ε: (2.7) with independent standard normal noise ε added • cos1: The cosine function (2.7) with correlated noise generated by transforming the columns ε j of an independent standard normal matrix ε ε j = Aε j , j = 1, ..., M , and using the vectors ε j as columns of the noise matrix ε.The elements of A are random, uniformly distributed in the interval [−1, 1].
• cos2: A sum of cosine functions with different frequencies with independent standard normal noise ε added with independent standard normal noise ε added • randn: Independent standard normal noise ε, i.e. f i = 0.
We investigated the cosine functions for different choices of the parameters α, β, γ and φ in model (2.6).While the parameters β and γ are more or less motivated by controlling possible artefacts, we will pay more attention to the choice of α and φ.
We simulated each of the functions cos0, cos1, cos2 and cos500 with each of the parameter sets given in table 1.For the noise model randn we will only use a, c, d and e because all other choices are either already included or do not make sense.
Note that we treat neither β, γ or φ as random variables.These are arbitrary fixed values which are randomly chosen.Only for φ we will consider the distribution of the chosen values.
The singular value decomposition is a linear method that rotates a matrix into a system of orthogonal bases, which are sorted with respect to how much of the values in the data set is explained by the several base vectors.Generally, it is a linear method and not designed to take care of time dependencies, i.e. ordered index sets, or nonlinear functions.Nevertheless, assuming a time dependent nonlinear function in the rows we will find a decomposition into a function base in the columns of v and the columns of u.Removing the first component both with respect to trend and variation in (2.3) and (2.4), we remove the component explaining the major part of the trend and the variation in the data set.
The constant α.Alter et al. (2000) interpret the removed first component as a constant trend.Since the normalization removes the component explaining the largest values, we can expect (ignoring the effect of β and γ) that α is removed when it is larger than the values of f i (t j + φ i ).And indeed we obtain very different patterns in the normalized and sorted data assuming α = 0 and α = 0.01 (fig.2:cos0b) on one hand and α = 5 (fig.2:cos0c) on the other.While we get uniformly distributed phases of y N for α = 5 we roughly observe only two different values of the phases of y N for the small values of α.In the latter case the largest part of the variation of y is explained by the cosine oscillation and since the cosine function is decomposed into (2.8) ) is filtered out by the normalization.The first two eigenarrays v N,1 and v N,2 confirm this conclusion (fig.3).In figure 3:cos0c we find for α = 5, that this constant is taken out by the normalization, leaving a clear cosine for v N,1 (red) and a clear sine for v N,2 (blue), while α = 0.01 in figure 3:cos0b produces a sine for the first eigenarray v N,1 (which was the second eigenarray in the other case) and a second eigenarray v N,2 , whose shape is not interpretable as a periodic function.The cosine part is not expressed anymore.In the remaining data the phases φ i are only contained in the coefficients C i and filtered out by the scale filtering (2.4), except their sign.That is why we find two phases with a difference of a half period length for small values of α = 0 and α = 0.01.
The gen specific effect β and the array specific effect γ.A vertical effect β did not affect the results of our experiments.We do not present the corresponding plots, since they are essentially equal to figure 2:cos0c for y N and 3:cos0c for the eigenarrays v N,1 and v N,2 .Obviously, the absolute values of vertical effects are filtered out by the scale normalization (2.4) since the sign is kept as we saw above for C i in (2.8).
In contrast to β we find γ clearly expressed as a horizontal structure in y N (fig.2:cos0d) and as the deviation from a sine function in figure 3:cos0d.
The phases φ.Considering the example cos0c assuming uniformly distributed phases φ one can obtain a relation between the phases φ i , the relative correlation ψ i and the phases in the normalized data y N since all are uniformly distributed: We fitted a Beta(r,s) distribution to ψ i and estimated both parameters very close to 1, i.e. a uniform distribution.For comparison we did the same for φ i , which we know are uniformly distributed (see table 2).Furthermore we find a clear linear relation between φ i and ψ i as it is shown in the upper right plot of figure 4. Translating the cluster Ω = {ψ i > 0, φ i < 0.45} on the right hand side by one period length, we consider the transformed phase φ i , we find the linear regression equation: Table 2: Estimated parameters of the Beta(r,s) distributions r ŝ cos0c φ i 0.9739 0.9805 cos0c ψ i 1.0059 1.0037 cos0f ψ i 1.0100 1.0303 Nevertheless, we can not conclude from the knowledge of ψ i to the properties of φ i .Assuming for example a very skewed distribution of φ i like we did using a downscaled χ 2 1 distribution (see the histogram of φ i in figure 4:cos0f on the left hand side) the distribution of ψ i is uniform, as the estimated parameters of a Beta(r,s) distribution indicate (see table 2) Assuming now a constant phase, all periodic structure is filtered out in the normalized data y N as seen in figure 2:cos0i.Here, the coefficient C i in (2.8) vanishes such that the second component disappears and the first is filtered out.The relative correlations ψ i are still uniformly distributed.
Obviously, the normalization changes the phase structure of a cosine function.We did not investigate other functions here.Possibly this is an effect of the addition theorem allowing the decomposition (2.8) and so a property of the trigonometric functions.
The different models.Comparing the two cosine models cos0 and cos500 the behavior of the normalized and sorted data y N is very similar if we have a strong constant term α (fig.2:cos0c and 1:cos500c), although the pattern looks somewhat clearer for cos0, where the eigenarrays are much closer to cosine/sine functions than those for cos500 (fig.3:cos0c and  cos500c).If not all genes follow a cosine, the first component of (2.8) can not explain the behavior of all genes, and we can therefore expect that a common trend will be filtered out first.For cos500 there are 90% non trigonometrically explainable genes and indeed none of the two trigonometric components are removed (fig.1:cos500b) as it happens for cos0 (recall fig.2:cos0b).We note for all experiments using cos500 that even if only a small part of the genes are cyclic (10%) the normalized expressions behave cyclic for most of the genes.The periodicity "leaks out" to the entire data set.Considering only one arbitrary noncyclic gene it may seem like the normalization creates a periodicity.
The differences between cos0 and cos500 get clearer when we consider the correlations between the normalized data and the first two eigenarrays v N,1 and v N,2 .Considering cos0 both the first two eigenarrays and the normalized genes are cyclic, such that we find a large number of high correlations (fig.6:cos0c) and very few correlations around zero.For cos500 the eigenarrays v N,1 , v N,2 and the normalized data are cyclic as well but more disturbed, so that the correlations are much lower (fig.6:cos500c).For model cos0 with a small constant effect α such that the first cyclic component is filtered out, all rows of y N are still cyclic but the second eigenarray is not, such that we get high correlations with respect to v N,1 but not with respect to v N,2 producing the non radial figure 6:cos0b.
Assuming now the mixed model of two cosine functions (cos2) we get four cyclic eigenarrays since generalizing to K cosine functions we have Assuming now a small constant effect α and filtering out the first cyclic component, the second eigenarray becomes the first component with a higher frequency as shown in figure 3:cos2b, leading to a clearly visible high frequency structure in the normalized data (fig.1:cos2b).Assuming a large α the pattern of the sorted normalized expressions y N is similar to the other two discussed models but the high and the low value regions are narrower (fig.1:cos2c).The high frequency is not visible because it is less expressed in the data.So the correlations shown in figure 6:cos2c are only slightly lower than for cos0 even though the first two eigenarrays show the same clear sine/cosine as in figures 3:cos0c and 3:cos0f.In this case it is very difficult to distinguish between the models cos0 and cos2 only considering the plots of one data set for each model.
The cosine models with independent noise lead to clear patterns.In contrast to that we can not find a similar behavior by adding correlated noise.Except an oscillation at the first two time points of the normalized data y N (fig.2:cos1c), which is difficult to interpret, we can not see any structures, neither in the normalized data nor in the eigenarrays (fig.3:cos1c).
The normalized data or the eigenarrays of the pure Gaussian noise model did not show any clear structures so we did not present them here.
We observed only one effect common for all considered data sets: The first two components of the ordered eigengenes of the normalized data set show an oscillation with a phase difference of N/4 even for the independent noise data set (fig. 5) as well as the observed data of Alter et al. (2000).Probably, this is not a prop-erty of the data set.On the other hand, all data sets we considered were very regular and they all contained periodic structures if we consider the constant function f i (t j ) = 0 as a degenerated periodic function.Maybe we have to use data sets with more irregularities like strongly correlated clusters or nonperiodic functions to change this.
The elutriation data.For the yeast Saccharomyces cerevisiae we consider the entire data set containing the elutriation synchronized genes which was investigated by Alter et al. (2000) (denoted in the plots by elutriation) and the sub-sample of cell cycle regulated genes given in Spellman et al. (1998) (elutriation CCR).Assuming that the CCR genes follow the cell cycle, i.e. are periodically expressed and the entire data set is a mixture of periodically and nonperiodically expressed genes we expect that the normalized elutriation and cell cycle regulated elutriation data sets behave similar to cos500 and cos0 respectively.The plots of y N in the figures 1:elutriation and 1:elutriation CCR indicate that.In both cases we obtain the typical structures of a time dependent effect which is described and interpreted by Alter et al. (2000) and observed in our experiments (fig.2:cos0d).On the other hand the behavior of the correlations differ a lot. Figure 6:elutriation shows that the correlations of the normalized data with respect to the first two eigenarrays fill the unit circle.They reach values closer to 1 than we observed in any simulation, including the pure cosine models cos0 and cos2.At the same time we do not have the empty space around zero as in figure 6:cos0c.Obviously, there must be genes, whose normalized expression follow a linear combination of the first two eigenarrays very tightly, and genes which do not behave like that.The assumption that the CCR-genes could be the first ones is not supported by their correlation plot in figure 6:elutriation CCR.We can not observe special properties like for example a higher frequency of high correlations.It seems to be a subsample of the entire data set.
Considering the phases, we note that we should not be led to conclude from the plots of y N , where the phases are uniformly distributed, that the phase distribution in the raw data set is uniform.In Aßmus (2006) a Beta distribution is fitted to the data with the result that the phases of the periodically expressed genes are not uniformly distributed.A uniform distribution would also have been difficult to interpret from a biological point of view.
Finally, to summarize we conclude that the SVDnormalization as it is investigated in Alter et al. (2000) is a powerful method to remove unwanted effects from a data set if we are able to locate them in the eigenarrays.It should, however, be used cautiously to avoid the removing of effects, that are important for the description of the data and even more because the interpretation of the normalized data is not as obvious as it may appear.

A Discrete Fourier Approach
In the previous section we discussed the singular value decomposition approach to describe microarray panels.As we saw it is difficult to interpret the result since the time dependent structure is changed by the method.Now we will introduce a heuristic method to investigate periodicities in microarray panels.

Basics
Let us first consider only one gene which is periodically expressed and assume that it follows different harmonic cycles with the corresponding frequencies ω k and amplitudes A k .In addition we assume a non frequency dependent phase φ such that we get at a time point t the expression Y , where ε t denotes a zero mean noise term which is independent for different time points.
Considering measurements at M equidistant time points t j , j = 1, ..., M and the specific frequencies ω k = k, a very common and useful approach is the spectral analysis using the discrete Fourier transformation (DFT) where we transform the data from time domain into frequency domain, where ∆t = t j − t j−1 denotes the length of a time interval and i the complex imaginary unit.The values of the Fourier transformed expression Ỹ (ω l ) are complex where the real part corresponds to B k , and the imaginary part to C k in model (3.1).To find how much of the variation in the function is explained by the frequencies ω l we use the periodogram I(ω l ) where Ỹ * (ω l ) is the conjugated complex Fourier transformed expression at ω l and L the largest integer smaller than M/2.In model (3.1) the phase φ is contained in B k and C k .This complex structure is eliminated by the periodogram, i.e. we decompose the time series into the frequency domain and need no knowledge about the phase φ.
The weakness of this approach is that the number of frequencies which can be used is strongly limited especially if the time series is very short as is the case in microarray analysis.But in most microarray experiments we have many short time series of similar shape, each of them containing information about the frequency structure.Furthermore, if we have, for example, uniformly distributed phases as in the normalized elutriation data in the previous section, the entire panel contains more information than each single gene expression.We will use this information by aligning the gene expressions to achieve more dense observations.

The Aligned DFT Approach
Let us now consider a panel Y = {Y ij }, i = 1, ..., N, j = 1, ..., M of time dependent gene expressions for N genes at M time points.Furthermore we assume for all genes common amplitudes A k , a common frequency ω 0 and gene specific phases φ i .Using (3.1) with ω k = kω 0 for each gene we get at each of the time points t = [t 1 , ..., t M ] the model with the realizations y = {y ij }, i = 1, ..., N, j = 1, ..., M .The expression functions of the genes differ only in the phase φ i , i.e. assuming known φ i we have in fact N M realizations of the same function.Diggle and al Wasel (1997) use that the periodogram is phase independent, compute the periodogram for each row corresponding to the genes in microarrays, and investigate the average of the periodograms.This method would be able to improve on the periodogram.On the other hand the number of frequencies we can estimate is strongly restricted in the microarray case since we usually have quite few time points.Assuming M = 14 time points as in the elutriation data we can not consider more than M/2 = 7 frequencies as seen in (3.2).
In Aßmus (2006) the maximum likelihood estimation and a mixed model to estimate parameters in the model are investigated.But since we have to solve complicated equations numerically we are quite restricted in the number of parameters.The maximum likelihood approach gives a possibility to estimate the phases φ i .Once knowing the phases we can align the genes by relocating the time points such that all genes have different time points but the same phase.Now we can put them in one vector such that we have N M data points in an interval which is maximally twice as large as the original time window if we restrict the phases φ to one period φ ∈ [− π ω0 , π ω0 ].Since the time points of the aligned vector are not necessarily equidistant, we have interpolated them to an equidistant lattice τ with M τ points.The interpolated vector can be analyzed by Fourier analysis.
Assuming now a panel y = {y ij }, i = 1, ..., N, j = 1, ..., M as realizations of the model (3.3), as it is shown for one frequency and N = 2 in figure 7a, we suggest the algorithm 1. Estimate the phases φ i for all genes for example using the maximum likelihood estimation denoting the estimates by φi .Aliasing and maximal resolution.Assuming Y (t) in model (3.1) known for all t ∈ R the frequencies ω k are uniquely determined, because for two cosines with fixed amplitude A, a fixed phase φ 0 and frequencies ω and ω it follows from A cos(ω(t + φ 0 )) = A cos(ω (t + φ 0 )) for all t ∈ R that ω = ω .
On the other hand if we consider a data set where we do not know the function for its entire range but only for the measured discrete time points t = t 1 , ..., t M , then the frequencies are not uniquely determined.We have to restrict the frequency range.So the DFT is only computed for a number of discrete frequencies.Assuming one function measured at equidistant time points t j the data do not contain information about how often the function oscillates between two time points so there is an upper bound of the resolution we can achieve.From the Nyquist-Shannon Sampling Theorem which is found in most literature about signal analysis or DFT, e.g.Brillinger (1981), p.179, we know that the highest frequency we can resolve is the so called Nyquist-frequency for the model (3.1), where ∆t = t j − t j−1 .Considering non equidistant time scales, for example assuming missing data, the usual approach is to interpolate the function to an equidistant time scale.We use the Nadaraya-Watsonestimation (3.12) in our experiments to estimate (3.7).
There is a wide range of alternatives such as polynomials, splines, Newton or Lagrange methods (Schwarz, 1988).The maximal resolution is now the Nyquistfrequency of the interpolated time scale.
In our approach the interval length of the time scale for one gene is ∆t = (t M − t 1 )/(M − 1) such that the Nyquist-frequency is After aligning and interpolating the Nyquist-frequency is analogously Since the time intervals [t 1 , t M ] and [τ 1 , τ Mτ ] are of similar size but M τ is much larger than M the resolution of the aligned data set is much higher than for each gene.
We note that the maximal resolution is only dependent on the time scale.This means that assuming an appropriate distribution of phases we can reach a resolution as high as we want for every fixed M even in the extreme case also for M = 1.The only thing we have to do is to increase the number of genes N .We will not discuss here, what properties an appropriate phase distribution must have.The more uniform the distribution of time points in the aligned time vector (3.5) the more appropriate will the distribution be.A single or two point distribution, i.e. one or two fixed phases, will for example not be appropriate.
The Nyquist-frequency does not contain information of which frequencies we actually can find in the aligned data because this depends not only on the time scale but on the distribution of the phases φ i , the estimated phases φ, the interpolation method and the noise level Varε ij Phase estimation.Since the estimated phases are subtracted from the time points in the aligned data [ t, ỹ], the quality of the phase estimation is a strongly limiting factor of the entire procedure.To take the estimation error into account we have to consider each time point in t, tj = tj + η j as a sum of a non random tj and a noise term η j leading to the model It is to be expected that even small variations of φi can destroy a lot of the high frequency structure of the data set.This is a serious problem since the estimation of φ i can be difficult.
In Aßmus (2006) the maximum likelihood estimation of the phases and the amplitudes simultaneously is investigated, assuming Beta(r,s) distributed phases.We fit a Beta distribution to the estimated phases and observe a bias of the estimated parameters r and ŝ for small values of M .Especially for larger true values of r and s there is a strong bias while the distribution of the estimated phases is for r = s = 1 quite close to a uniform distribution.
Definition of the lattice τ .Since the elements of the aligned time scale are random the definition of the lattice τ deserves some attention.
Naively one could assume that it is a natural choice to split the entire interval covered by the aligned time scale into N M −1 intervals such that we get the lattice τ : If not all phases are equal, the aligned time scale t covers a larger range than the original time scale t.Furthermore, assuming uniformly distributed phases φ i there are less observations at the ends of the interval [min( t), max( t)] (fig.8a).If we nevertheless use the naive lattice (3.9) we have at the ends of the covered interval much more time points than observations.Since it makes no sense to interpolate too finely, it is convenient to reduce the range of the lattice to ensure that we have enough observations.One possibility is to define the lattice within the interval [t 1 , t M ].
Using a reduced range for the lattice makes it necessary to reduce M τ as well.In the example seen in figure 8a we can see, that the density of the aligned time points is not necessarily constant on the interval [t 1 , t M ] = [0, 1].Again, it makes no sense to interpolate too finely.We suggest using the area with least number of observations to adjust M τ .One approach is to compute the histogram of the aligned time points in the lattice range, find the interval ∆ l (one bar of the histogram) with the lowest histogram value and choose (3.10) We suggest only an upper bound for M τ because a meaningful choice of M τ depends also on the distribution of the phases.To illustrate this we will only mention two extreme examples, where we use true (not estimated) phases to align the time points.
If the phases are chosen from {k∆t, k = 0, ..., M } the aligned time points {t 1 , ..., t i } are found on the lattice t 1 + j∆t, j = 1, ..., 2M , such that they are equidistant with many observations at each time point (fig.9a).In this case it makes no sense to interpolate at all.We can only use the mean of the aligned expressions for each time point such that we only get more time intervals of the same length, i.e. no increased Nyquistfrequency.The result will be similar to the average periodogram.This case is not very realistic but it il-lustrates the problem of the dependence of the lattice on the distribution of the phases.
In the other extreme case, namely phases uniformly distributed at one time interval ∆t, there is no overlap of the t j − φ i for the different values of j such that we have a uniform distribution as shown in figure 8b for a larger data set.Nevertheless the aligned time scale is not equidistant (fig.9b) such that we have to interpolate.On the other hand we can use the naive lattice (3.9) or a lattice very close to that because of the uniformly distributed values of t.
Assuming the phases uniformly distributed in [− π ω , π ω ] (equivalent to fig.9c) as we do in our experiments motivated by the elutriation data set in the previous section, we can locate the interval ∆ l in (3.10) at the ends of the interval [t 1 , t M ].Because of the special triangle shape of the density (fig.8a) this is a value close to 1 2 N M such that we use this as an upper bound for M τ , leading analogously with (3.9) to the lattice To interpolate on a chosen lattice we have to specify the estimator for the conditional expectation in (3.7).There is a wide range of non-or semi-parametric smoothers.We chose the Nadaraya-Watson-estimator: where K denotes a kernel function and b τ the bandwidth whose choice depends on τ .We will not investigate here the influence of these parameters on the estimate.The Nadaraya-Watson-estimation is a common method which is widely discussed in the literature, e.g.Györfi et al. (1989).

Application to data sets
The focus.In this section we will use simulated data to investigate the effects discussed in the previous section.As we saw we have many possibilities to vary the procedure.There are roughly three main points: -choice of the estimation procedure in step 1, -choice of the lattice τ (3.6), -choice of the interpolation method and its parameters.
We will focus on the influence of the estimation of the phases φ i and not discuss the influence of the interpolation method and the choice of the lattice τ .The data set.We generated data sets containing N genes at M time points using the model for N = 500, 1000, 5000, M = 4, 15, the time points t = [0, 1/(M − 1), ..., 1] and the frequencies ω = [2π, 20π, 40π, 100π].All amplitudes are 1.The noise in all experiments consists of independent standard normal variables and the phase is uniform on [0, 1].In all experiments we use the lattice (3.11), All Nyquist frequencies corresponding to the different lattices defined by N and M are much larger than the highest frequency in the data (table 3).As interpolation method we use the Nadaraya-Watson-estimation (3.12) with a Gaussian kernel and the bandwidth b τ = 5∆τ .
The experiments.For these data sets we will do two experiments, one where we estimate the phases and one where we use the known phases but simulate an estimation error (see (3.8)) such that we can control the variance of these pseudo-estimates of phases, i.e. we start in the second step of the algorithm and use in (3.4) the pertubated true phases, instead of the estimated phases.
Assuming that we know the phases exactly, i.e. σ η = 0, all frequencies in the data are clearly resolved in the periodograms of the aligned data (upper plots of fig.11 and 12.Even if we only have 4 time points we resolved the highest frequency ω 4 = 100π. Increasing the standard deviation of the σ η , the higher frequencies disappear quite fast.Figure 10 shows, how the behavior of the periodogram value corresponding to the frequencies in the data set depends on σ η .The noise levels at which the frequencies are not observable anymore (periodogram value reach zero level) are lowest for the highest frequencies.It matches our anticipation that the highest frequencies disappear first.Already for the low value of σ η = 0.007 the highest frequency ω 4 = 100π is not detectable anymore, while ω 3 = 40π vanishes at σ η = 0.02 and ω 3 = 20π at σ η = 0.035.In the third and fourth row of the figures 11 and 12 we see that ω 2 , ω 3 and ω 4 really are almost not detectable in the periodogram anymore.
Even if the periodogram value for σ η = 0 is much lower for small than for the large sample experiments, the curves reach the zero level approximately for the same σ η independent of the sample size, i.e. increasing the sample size in the investigated ranges gives no essential improvement of the histograms.The sample size determines only the level of the curves and the smoothness such that we have very similar curves for N = 1000, M = 15 (M τ = 7500) and N = 5000, M = 4 (M τ = 10000).Obviously it is the variance of η i which determines if a frequency is resolved or not.Intuitively it is clear that the high frequency effects which determine fine scaled structures are not robust with respect to noise in the arguments.A frequency like ω 4 = 100π has a period length of T = 0.02 and assuming σ η = 0.08 the probability that the error of an aligned time point is more than a half period is Considering now the experiments using the estimated phases in (3.4) we need some additional remarks before starting.The idea of the alignment procedure is to detect frequencies in the data set, but for a maximum likelihood model we must know or at least estimate them to get an estimation of the phases.Ignoring that and only using a one frequency model with a known leading frequency of one gene leads to a dilemma.A strong expression of the high frequencies in the data will lead to bad estimates because we can expect a large model error.If the higher frequencies are weakly expressed in the data they are weakly expressed in the periodogram as well and will be easier dominated by noise effects.
We tried to solve this problem by smoothing the data set to remove the higher frequencies before estimating the phases.The estimates of the phases have standard deviations between 0.22 (N = 5000, M = 15) and 0.38 (N = 500, M = 4), i.e. much larger than the standard deviations of η i where the peaks in the periodogram vanished.And indeed we do not obtain the higher frequencies in the periodograms using estimated phases as shown in the lowest row of the figures 11 and 12.The phase estimates are too bad to be used in a procedure like we discussed here.To solve this problem one has to know more about the phases or improve the estimator dramatically.Since we did not know anything about the phases of the data sets we have available we abstained from an investigation of real data using this method.
These results inspired us to leave this heuristic method and investigate the maximum likelihood estimation of parameters in a cosine model in Aßmus (2006).
, and the clearly visible relation we obtained in the other case disappears as seen in lower right plot of figure 4:cos0f.The phases of the normalized data y N are uniformly distributed as well (fig.2:cos0f).The first two eigenarrays in figure 3:cos0f) do not differ from the corresponding eigenarrays in figure 3:cos0c.

Figure 1 :Figure 2 :
Figure 1: Plots of the normalized and sorted gene expressions y N of the elutriation data and different simulated data sets (title codings are the different functions introduced in Section 2.2 and the parameter sets in table 1.

Figure 3 :Figure 4 :Figure 5 :Figure 6 :
Figure3: The first two eigenarrays v N,1 (solid) and v N,2 (dash) of the normalized and sorted gene expressions of the elutriation data and different simulated data sets (title codings are the different functions introduced in Section 2.2 and the used parameter sets in table 1.

Figure 7 :Figure 8 :Figure 9 :Figure 10 :Figure 11 :
Figure 7: Algorithm for the Fourier analysis for two gene expression function (a) illustrating the alignment (b), the interpolation (c) and the DFT of the interpolated vector (d)

Table 3 :
Nyquist frequencies for the different generated data sets