Split and conquer

8/31/2023

In this case, without touching the existing penalized regression methods, we split the whole dataset into K subsets of smaller sample sizes. The proposed methodology is illustrated numerically using both simulation and real data examples. split-and-conquer approach for the situation that n is extraordinarily large, too large to perform the aforementioned penalized regression using a single computer or avail-able computing resources to us. Similar to what reported in the literature, we can establish an upper bound for the expected number of falsely selected variables and a lower bound for the expected number for truly selected variables. Furthermore, we demonstrate that the approach has an inherent advantage of being more resistant to false model selections caused by spurious correlations. Is used in the sense that its computing expense is at the order of O(na pb), a > 1 and b ≥0, we show that the split-and-conquer approach can substantially reduce computing time and computer memory requirement. In addition, when a computational intensive algorithm When K is well controlled, we also show that the combined result is asymptotically equivalent to the result of analyzing the entire data all at once (assuming that there is a super computer that could carry out such an analysis). We show that under mild conditions the combined overall result still retains desired properties of many commonly used penalized estimators, such as the model selection consistency and asymptotic normality. For each subset of data, we perform a penalized regression analysis and the results from each of the K subsets are then combined to obtain an overall result.

We propose to randomly split the data of size n into K subsets of size O(n/K). Consider a regression setting of generalized linear models with n observations and p covariates, in which n is extraordinarily large and p is either bounded or goes to ∞ at a certain rate of n. If there are extraordinarily large data, too large to fit into a single computer or too expensive to perform a computationally intensive data analysis, what should we do? To deal with this problem, we propose in this paper a “split-and-conquer'' approach and illustrate it using several computationally intensive penalized regression methods, along with a theoretical support.

0 Comments

Author

Archives

Categories

Split and conquer

Leave a Reply.