Locally Sparse Functional Regression

Overview

In this post we present a new estimation procedure for functional linear regression useful when the regression surface – or curve – is supposed to be exactly zero within specific regions of its domain. Our approach involves regularization techniques, merging a B-spline representation of the unknown coefficient function with a peculiar overlap group lasso penalty. The methodology is illustrated on the well-known Swedish mortality dataset and can be employed by \({\tt R}\) users through the package \({\tt fdaSP}\).

Introduction

We consider the framework on which a functional response \(y_i(s)\) with \(s \in \mathcal{S}\) is observed together with a functional covariate \(x_i(t)\) with \(t \in \mathcal{T}\) for \(i=1, \dots, n\). The simplest modeling strategy for function-on-function regression is the concurrent model, which assumes that \(\mathcal{T} = \mathcal{S}\) and the covariate \(x(\cdot)\) influences \(y(s)\) only through its values \(x(s)\) at the domain point \(s \in \mathcal{S}\). The relation when both variables are centered is given by

\[ y_i(s) = x_i(s)\psi(s) + e_i(s)\]

where \(\psi(s)\) is the regression function and \(e_i(s)\) is a functional zero-mean random error. The more general approach, named nonconcurrent functional linear model, allows \(y_i(s)\) to entirely depend on the functional regressor in the following way

\[ y_i(s) = \int_{\mathcal{T}} x_i(t) \psi(t, s) dt + e_i(s) . \] The model allows \(\mathcal{T} \neq \mathcal{S}\) and the bivariate function \(\psi(t, s)\) represents the impact of \(x(\cdot)\) evaluated at \(t \in \mathcal{T}\) on \(y_i(s)\) and usually is a “dense” function. Our goal is to introduce a nonconcurrent functional linear model that allows for local sparsity patterns. Specifically, we want that \(\psi(t, s) = 0\) for \((t, s) \in \mathcal{D}_0\) with \(\mathcal{D}_0\) being a suitable subset of the domain \(\mathcal{D}\) of \(\psi(\cdot)\), with \(\mathcal{D} = \mathcal{S} \times \mathcal{T}\), thus, inducing locally sparse Hilbert-Schmidt operators. Note that the set \(\mathcal{D}_0\) is unknown and need to be estimated from data.

Locally sparse functional model

B-splines and sparsity

In order to represent functional objects using basis expansion, we select a basis \(\{\theta_l(s), l = 1, \dots , L\}\) of dimension \(L\) in the space of square integrable functions on \(\mathcal{S}\) and a basis \(\{\varphi_m(t), m = 1, \dots ,M\}\) of dimension \(M\) in the space of square integrable functions on \(\mathcal{T}\). Exploiting a tensor product expansion of these two, we represent the regression coefficient \(\psi\) as

\[\begin{eqnarray} \psi(t,s) & = \sum_{m=1}^M \sum_{l=1}^L \psi_{ml} \varphi_m(t) \theta_l(s) = & (\varphi_1(t), \dots, \varphi_M(t)) \left( \begin{matrix} \psi_{1,1} & \cdots & \psi_{1,L} \\ \vdots & \ddots& \vdots \\ \psi_{M,1} & \cdots & \psi_{M,L} \end{matrix}\right) \left( \begin{matrix} \theta_{1}(s)\\ \vdots \\ \theta_{L}(s) \end{matrix}\right) = \boldsymbol{\varphi}(t)^T \boldsymbol{\Psi} \boldsymbol{\theta}(s), \label{eq:tensorproduct} \end{eqnarray}\]

where \(\psi_{ml} \in \mathbb{R}\) for \(l=1, \dots, L\) and \(m=1, \dots, M\). A key point is to assume that elements in this representation are B-splines (De Boor 1978) of order \(d\) with \(L − d\) and \(M − d\) interior knots, respectively. Suitable zero patterns in the B-spline basis coefficients of \(\boldsymbol{\Psi}\) induce sparsity in \(\psi(t, s)\). Let \(\tau_1 < \dots < \tau_m < \dots < \tau_{M−d+2}\) and \(\sigma_1 < \dots < \sigma_l < \dots < \sigma_{L−d+2}\) denote the knots defining the tensor product splines, with \(\tau_1, \tau_{M−d+2}\) and \(\sigma_1, \sigma_{L−d+2}\) being the boundaries of the two domains, and let \(\mathcal{D}_{ml} \in \mathcal{D}\) be the rectangular subset of \(\mathcal{D}\) defined as \(\mathcal{D}_{ml} = (\tau_m, \tau_{m+1}) \times (\sigma_l, \sigma_{l+1})\) for \(m = 1, \dots ,M − d + 1\) and \(l = 1, \dots , L − d + 1\). To obtain \(\psi(t, s) = 0\) for each \((t, s) \in \mathcal{D}_{ml}\), it is sufficient that all the coefficients \(\psi_{m'l'}\) with \(m' = m, \dots ,m + d − 1\) and \(l' = l, \dots , l+d−1\) are jointly zero. The following figure further clarifies this concept.

Bivariate example with tensor product cubic B-splines (d=4). The top row shows different coefficient matrix patterns, while the bottom row shows the corresponding spline, where dots represent knots and the set D is highlighted in red. The first 2 columns show a coefficient matrix with isolated zeros and a (d-1) x (d-1) block of zeros, respectively. None of the two is able to produce a sparse function. In the last column, conversely, an entire d x d block of coefficients is null and the resulting function is indeed sparse.

The previous figure suggests that, in general, \(\psi(t, s)\) equals zero in the region identified by two pairs of consecutive knots if the related \(d\) × \(d\) block of coefficients of \(\boldsymbol{\Psi}\) is entirely set to zero. Therefore, \(\boldsymbol{\Psi}\) should be suitably partitioned in several blocks of dimensions \(d\) × \(d\) on which a joint sparsity penalty is induced.

Note that in the simpler situation where a functional covariate \(x_i(t)\) and a scalar response \(y_i\) are observed we have the following simplifications. The functional linear model is

\[ y_i = \int x_i(t) \psi(t) dt + e_i, \] the regression function is expanded as \(\psi(t) = \sum_{m=1}^M \psi_m \varphi_m(t)\) and the set \(\mathcal{D}_m = (\tau_m, \tau_{m+1})\) is an interval of the real line. To obtain \(\psi(t) = 0\) for each \(t \in \mathcal{D}_{m}\), it is sufficient that all the coefficients \(\psi_{m'}\) with \(m' = m, \dots ,m + d − 1\) are jointly zero and \(\psi(t)\) equals zero in the interval identified by two pairs of consecutive knots if the related \(d-\)dimensional subvector of coefficients is entirely set to zero. This behaviour is illustrated in the next figure.

Univariate cubic B-spline basis and resulting spline functions. Dashed curves correspond to bases with a zero-valued coefficient. Only in the case when d=4 consecutive coefficients are zero the resulting spline function is null on a set of positive Lebesgue measure (in red).

Overlap group Lasso

Having in mind the above mentioned B-splines sparsity properties, in order to estimate a locally-sparse regression surface, we minimize the following objective function

\[ \frac{1}{2} \sum_{i=1}^n \int \bigg( y_i(s) - \int x_i(t)\psi(t,s) dt \bigg)^2 ds + \lambda \Omega(\boldsymbol{\Psi}), \]

but, instead of specifying the functional form for the penalty \(\Omega(\cdot)\) as the widely-employed Lasso (Tibshirani, 1996) or group-Lasso (Yuan and Lin, 2006) that would not work properly (see paper for details), we propose to use something truly tailored for the problem. First, instead of a disjoint partition, we define an overlapping sequence of blocks of size \(d\) × \(d\). Specifically, we introduce the block index \(b = 1, \dots , B\) with the total number of blocks denoted by \(B = (M − d + 1) \times (L − d + 1)\). Notably, there is a block for each set \(\mathcal{D}_{ml}\). This overlapping group structure allows \(\mathcal{D}_0\) to be the union of any set \(\mathcal{D}_{ml}\) by moving a block of minimum size. An example of overlapping covering when \(d=4\) (cubic splines) is shown in the following figure.

Overlapping covering of the coefficient matrix (L = M = 12) with the first 4 blocks of size d × d, d = 4. Colors according to the balancing vector.

This construction suggests specifying a penalty \(\Omega\) for overlapping groups of coefficients, which has attracted significant interest in the last decade, see Jacob, Obozinski and Vert (2009), Jenatton, Audibert and Bach (2011) and Lim and Hastie (2015). Being interested in the sparsity structure of the matrix of coefficients \(\boldsymbol{\Psi}\) rather than its support we particularize the previous problem as

\[ \frac{1}{2} \sum_{i=1}^n \int \left( y_i(s) - \int x_i(t) \psi(t,s) dt \right)^2 ds + \lambda \sum_{b=1}^{B+1} || c_{b} \odot \boldsymbol{\psi} ||_2, \]

where \(\lambda > 0\) is a fixed penalization term, and \(\Omega\) specifies in the sum of \(B + 1\) Euclidean norms \(\lVert c_b \odot \boldsymbol{\psi} \rVert_2\), where \(\boldsymbol{\psi} = \text{vec}(\boldsymbol{\Psi})\), and \(\odot\) represents the Hadamard product. The index \(b\) denotes the block of coefficients in \(\boldsymbol{\Psi}\), with the first \(B\) blocks being consistent with the aforementioned construction and the last block containing all coefficients in \(\boldsymbol{\Psi}\). Vectors of size \(ML\), denoted by \(c_b\), are needed to extract the correct subset of entries of \(\boldsymbol{\Psi}\) and contain also known constants that equally balance the penalization of the coefficients. This balancing is needed because the parameters close to the boundaries appear in fewer groups than the central ones. Note that this penalty constitutes a special case of the norm defined by Jenatton, Audibert, and Bach (2011).

In the case of a scalar response, the objective function becomes

\[ \frac{1}{2} \sum_{i=1}^n \left( y_i - \int x_i(t) \psi(t) dt \right)^2 + \lambda \sum_{b=1}^{B+1} || c_{b} \odot \boldsymbol{\psi} ||_2 \]

where \(\boldsymbol{\psi}\) and \(c_b\) are vectors of dimension \(M\) and each of the first \(B\) blocks contains \(d\) consecutive coefficients, as in the following example.

Overlapping covering of the coefficient vector (M = 12) when the response is scalar with the first 4 blocks of size d = 4. Colors according to the balancing vector.

Computational Considerations

To develop an efficient computational strategy, we introduce the empirical counterparts of the quantities described in the previous section assuming to observe a sample of response curves \(y_i\) with \(i=1, \dots,n\) on a common grid of \(G\) points, i.e. \(y_i = (y_i(s_1), \dots, y_i(s_G))^T\). Let also \(x_i\) be the related functional covariate observed on a possibly different but— common across \(i\)—grid of points, that for simplicity and without loss of generality, we assume of length \(G\). Let \(\mathbf{X}\) be the \(n \times G\) matrix with \(x_i\) in the rows. Let \(\boldsymbol{\Phi}\) and \(\boldsymbol{\Theta}\) be the \(M \times G\) and \(L \times G\) matrices defined as \[ \boldsymbol{\Phi} = \begin{pmatrix} \varphi_1(t_1) & \cdots & \varphi_1(t_G) \\ \vdots&&\vdots \\ \varphi_m(t_1) & \cdots & \varphi_m(t_G) \\ \vdots&&\vdots \\ \varphi_M(t_1) & \cdots & \varphi_M(t_G) \\ \end{pmatrix}, \quad \quad \boldsymbol{\Theta} = \begin{pmatrix} \theta_1(s_1) & \dots & \theta_1(s_G) \\ \vdots&&\vdots \\ \theta_l(s_1) & \cdots & \theta_l(s_G) \\ \vdots&&\vdots \\ \theta_L(s_1) & \cdots & \theta_L(s_G) \\ \end{pmatrix}, \] and let \(\mathbf{Y}\) and \({\bf E}\) be the \(n\times G\) matrices obtained as \(\mathbf{Y} = (y_1, \dots, y_n)^T\) and \(\mathbf{E} = (e_1, \dots, e_n)^T\) , with \(e_i = (e_i(s_1), \dots, e_i(s_G))^T\). The function-on-function linear regression model can be equivalently written in matrix form as \[ \mathbf{Y} = \mathbf{X} \boldsymbol{\Phi}^T \boldsymbol{\Psi} \boldsymbol{\Theta} + {\bf E}. \]

Applying the vectorization operator on each side of the equality above, we obtain the new optimization problem

\[ \frac{1}{2}\Vert \mathbf{y} - \mathbf{Z} \boldsymbol{\psi}\Vert_2^2 + \lambda \sum_{b=1}^{B+1} \Vert \mathbf{D}_{b}\boldsymbol{\psi} \Vert_2, \]

where \(\mathbf{y}=\text{vec}(\mathbf{Y})\), \(\boldsymbol{\psi}=\text{vec}(\boldsymbol{\Psi})\) is the vector of coefficients of dimension \(LM\), \(\mathbf{Z} = \boldsymbol{\Theta}^T \otimes \mathbf{X} \boldsymbol{\Phi}^T\) and \(\mathbf{D}_b=\mathrm{diag}(c_b)\) is a diagonal matrix whose elements correspond to the elements of the vector \(c_b\). The minimization of the quantity above is challenging because of the non-separability of the overlap group Lasso penalty. We propose a Majorization-Minimization (MM) algorithm (Lange, 2016) to obtain the solution, although other choices are viable e.g., the Alternating Direction Method of Multipliers (ADMM), see Boyd et al. (2011). The MM procedure works in two steps: in the first step a majorizing function \(\mathcal{Q}(\boldsymbol{\psi}\vert\widehat{\boldsymbol{\psi}}^{k})\) based on the actual estimate is determined and in the second step this function is minimized. Alternating between the two guarantees the convergence to a solution of the original problem. The curious reader can find the expression of \(\mathcal{Q}(\boldsymbol{\psi}\vert\widehat{\boldsymbol{\psi}}^{k})\) in the original paper, here we just point out that the function is quadratic and admits an explicit solution in the form of generalized ridge regression. A further speed-up, when the dimension of the problem is high, is made exploiting the Sherman-Morrison-Woodbury matrix identity, also known as the matrix inversion lemma.

When the response is scalar, we have the following modifications. The design matrix is \(\mathbf{Z} = \mathbf{X} \boldsymbol{\Phi}^T\), \(\mathbf{y}\) and \(\boldsymbol{\psi}\) are vectors of dimension respectively \(n\) and \(M\) and the diagonal matrix \(\mathbf{D}_{b}\) is of dimension \(M \times M\). The optimization problem is still valid and the same class of algorithms can be employed to obtain the solution.

Swedish mortality revisited

We apply the proposed locally sparse estimator to the Swedish mortality dataset, where the aim is to predict the log-hazard function on a given year from the same quantity on the previous year. We implement the model with \(d = 4, M = L = 20\) basis functions on each dimension and select the optimal value of \(\lambda\) by means of cross-validation. A perspective plot of the estimated surface is depicted in the last figure of the post. The estimate shows a marked positive diagonal confirming the positive influence on the log-hazard rate at age \(s\) of the previous year’s curve evaluated on a neighborhood of \(s\). At the same time, the flat zero regions outside the diagonal suggest that there is no influence of the curves evaluated at distant ages. Our estimate is more regular than previous approaches and its qualitative interpretation sharper and easier. Refer, for example, to Figure 10.11 of Ramsay, Hooker, and Graves (2009).

Estimated regression surface for the Swedish mortality dataset.

This result witnesses the practical relevance of adopting the proposed approach. Indeed, the resulting estimate, while being reminiscent of a concurrent model—inheriting its ease of interpretation—gives further insights and improves the fit, representing the desired intermediate solution between the concurrent and nonconcurrent models.

\({\tt R}\) package

The method is implemented in \({\tt R}\) through the package \({\tt fdaSP}\), available on CRAN. The function \({\tt f2fSP}\) can be used for a functional response regression model while \({\tt f2sSP}\) for a scalar response one. The two counterparts \(\texttt{f2fSP_cv}\) and \(\texttt{f2sSP_cv}\) are useful to select the tuning parameter by means of cross-validation.

This post is based on

Bernardi, M., Canale, A. and Stefanucci, M. (2022). Locally Sparse Function-on-Function Regression, Journal of Computational and Graphical Statistics, 32:3, 985-999, DOI: 10.1080/10618600.2022.2130926

References

Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society, Series B, 58, 267–288.
Yuan, M., and Lin, Y. (2006), “Model Selection and Estimation in Regression with Grouped Variables,” Journal of the Royal Statistical Society, Series B, 68, 49–67.
Jacob, L., Obozinski, G., and Vert, J.-P. (2009), “Group Lasso with Overlap and Graph Lasso,” in Proceedings of the 26th Annual International Conference on Machine Learning, pp. 433–440.
Jenatton, R., Audibert, J.-Y., and Bach, F. (2011), “Structured Variable Selection with Sparsity-Inducing Norms,” Journal of Machine Learning Research, 12, 2777–2824.
Lim, M., and Hastie, T. (2015), “Learning Interactions via Hierarchical Group-Lasso Regularization,” Journal of Computational and Graphical Statistics, 24, 627–654.
Lange, K. (2016), “MM optimization algorithms.” Society for Industrial and Applied Mathematics, Philadelphia, PA.
Boyd, S., Parikh, N., Chu, E., Peleato, B. and Eckstein, J. (2011), “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine Learning, 3, 1–122.
Ramsay, J. O., Hooker, G., and Graves, S. (2009), Functional Data Analysis with R and Matlab, NewYork: Springer.

About the authors

Mauro Bernardi is Associate Professor at University of Padua.

Antonio Canale is Associate Professor at University of Padua.

Marco Stefanucci is Assistant Professor at University of Rome Tor Vergata.