Methodology
See recent articles
Showing new listings for Monday, 30 September 2024
- [1] arXiv:2409.18358 [pdf, other]
-
Title: A Capture-Recapture Approach to Facilitate Causal Inference for a Trial-eligible Observational CohortSubjects: Methodology (stat.ME)
Background: We extend recently proposed design-based capture-recapture methods for prevalence estimation among registry participants, in order to support causal inference among a trial-eligible target population. The proposed design for CRC analysis integrates an observational study cohort with a randomized trial involving a small representative study sample, and enhances the generalizability and transportability of the findings. Methods: We develop a novel CRC-type estimator derived via multinomial distribution-based maximum-likelihood that exploits the design to deliver benefits in terms of validity and efficiency for comparing the effects of two treatments on a binary outcome. Additionally, the design enables a direct standardization-type estimator for efficient estimation of general means (e.g., of biomarker levels) under a specific treatment, and for their comparison across treatments. For inference, we propose a tailored Bayesian credible interval approach to improve coverage properties in conjunction with the proposed CRC estimator for binary outcomes, along with a bootstrap percentile interval approach for use in the case of continuous outcomes. Results: Simulations demonstrate the proposed estimators derived from the CRC design. The multinomial-based maximum-likelihood estimator shows benefits in terms of validity and efficiency in treatment effect comparisons, while the direct standardization-type estimator allows comprehensive comparison of treatment effects within the target population. Conclusion: The extended CRC methods provide a useful framework for causal inference in a trial-eligible target population by integrating observational and randomized trial data. The novel estimators enhance the generalizability and transportability of findings, offering efficient and valid tools for treatment effect comparisons on both binary and continuous outcomes.
- [2] arXiv:2409.18392 [pdf, other]
-
Title: PNOD: An Efficient Projected Newton Framework for Exact Optimal Experimental DesignsComments: 24 pages, 9 figuresSubjects: Methodology (stat.ME); Optimization and Control (math.OC)
Computing the exact optimal experimental design has been a longstanding challenge in various scientific fields. This problem, when formulated using a specific information function, becomes a mixed-integer nonlinear programming (MINLP) problem, which is typically NP-hard, thus making the computation of a globally optimal solution extremely difficult. The branch and bound (BnB) method is a widely used approach for solving such MINLPs, but its practical efficiency heavily relies on the ability to solve continuous relaxations effectively within the BnB search tree. In this paper, we propose a novel projected Newton framework, combining with a vertex exchange method for efficiently solving the associated subproblems, designed to enhance the BnB method. This framework offers strong convergence guarantees by utilizing recent advances in solving self-concordant optimization and convex quadratic programming problems. Extensive numerical experiments on A-optimal and D-optimal design problems, two of the most commonly used models, demonstrate the framework's promising numerical performance. Specifically, our framework significantly improves the efficiency of node evaluation within the BnB search tree and enhances the accuracy of solutions compared to state-of-the-art methods. The proposed framework is implemented in an open source Julia package called \texttt{this http URL}, which opens up possibilities for its application in a wide range of real-world scenarios.
- [3] arXiv:2409.18527 [pdf, other]
-
Title: Handling Missingness, Failures, and Non-Convergence in Simulation Studies: A Review of Current Practices and RecommendationsSubjects: Methodology (stat.ME)
Simulation studies are commonly used in methodological research for the empirical evaluation of data analysis methods. They generate artificial data sets under specified mechanisms and compare the performance of methods across conditions. However, simulation repetitions do not always produce valid outputs, e.g., due to non-convergence or other algorithmic failures. This phenomenon complicates the interpretation of results, especially when its occurrence differs between methods and conditions. Despite the potentially serious consequences of such "missingness", quantitative data on its prevalence and specific guidance on how to deal with it are currently limited. To this end, we reviewed 482 simulation studies published in various methodological journals and systematically assessed the prevalence and handling of missingness. We found that only 23.0% (111/482) of the reviewed simulation studies mention missingness, with even fewer reporting frequency (92/482 = 19.1%) or how it was handled (67/482 = 13.9%). We propose a classification of missingness and possible solutions. We give various recommendations, most notably to always quantify and report missingness, even if none was observed, to align missingness handling with study goals, and to share code and data for reproduction and reanalysis. Using a case study on publication bias adjustment methods, we illustrate common pitfalls and solutions.
- [4] arXiv:2409.18550 [pdf, other]
-
Title: Iterative Trace Minimization for the Reconciliation of Very Short Hierarchical Time SeriesSubjects: Methodology (stat.ME)
Time series often appear in an additive hierarchical structure. In such cases, time series on higher levels are the sums of their subordinate time series. This hierarchical structure places a natural constraint on forecasts. However, univariate forecasting techniques are incapable of ensuring this forecast coherence. An obvious solution is to forecast only bottom time series and obtain higher level forecasts through aggregation. This approach is also known as the bottom-up approach. In their seminal paper, \citep{Wickramasuriya2019} propose an optimal reconciliation approach named MinT. It tries to minimize the trace of the underlying covariance matrix of all forecast errors. The MinT algorithm has demonstrated superior performance to the bottom-up and other approaches and enjoys great popularity. This paper provides a simulation study examining the performance of MinT for very short time series and larger hierarchical structures. This scenario makes the covariance estimation required by MinT difficult. A novel iterative approach is introduced which significantly reduces the number of estimated parameters. This approach is capable of improving forecast accuracy further. The application of MinTit is also demonstrated with a case study at the hand of a semiconductor dataset based on data provided by the World Semiconductor Trade Statistics (WSTS), a premier provider of semiconductor market data.
- [5] arXiv:2409.18603 [pdf, other]
-
Title: Which depth to use to construct functional boxplots?Subjects: Methodology (stat.ME); Applications (stat.AP)
This paper answers the question of which functional depth to use to construct a boxplot for functional data. It shows that integrated depths, e.g., the popular modified band depth, do not result in well-defined boxplots. Instead, we argue that infimal depths are the only functional depths that provide a valid construction of a functional boxplot. We also show that the properties of the boxplot are completely determined by properties of the one-dimensional depth function used in defining the infimal depth for functional data. Our claims are supported by (i) a motivating example, (ii) theoretical results concerning the properties of the boxplot, and (iii) a simulation study.
- [6] arXiv:2409.18640 [pdf, other]
-
Title: Time-Varying Multi-Seasonal AR ModelsSubjects: Methodology (stat.ME)
We propose a seasonal AR model with time-varying parameter processes in both the regular and seasonal parameters. The model is parameterized to guarantee stability at every time point and can accommodate multiple seasonal periods. The time evolution is modeled by dynamic shrinkage processes to allow for long periods of essentially constant parameters, periods of rapid change as well as abrupt jumps. A Gibbs sampler is developed with a particle Gibbs update step for the AR parameter trajectories. The near-degeneracy of the model, caused by the dynamic shrinkage processes, is shown to pose a challenge for particle methods. To address this, a more robust, faster and accurate approximate sampler based on the extended Kalman filter is proposed. The model and the numerical effectiveness of the Gibbs sampler are investigated on simulated and real data. An application to more than a century of monthly US industrial production data shows interesting clear changes in seasonality over time, particularly during the Great Depression and the recent Covid-19 pandemic. Keywords: Bayesian inference; Extended Kalman filter; Locally stationary processes; Particle MCMC; Seasonality.
- [7] arXiv:2409.18712 [pdf, other]
-
Title: Computational and Numerical Properties of a Broadband Subspace-Based Likelihood Ratio TestJournal-ref: IEEE High Performance Extreme Computing Conference, Waltham, MA, September 2024Subjects: Methodology (stat.ME); Signal Processing (eess.SP)
This paper investigates the performance of a likelihood ratio test in combination with a polynomial subspace projection approach to detect weak transient signals in broadband array data. Based on previous empirical evidence that a likelihood ratio test is advantageously applied in a lower-dimensional subspace, we present analysis that highlights how the polynomial subspace projection whitens a crucial part of the signals, enabling a detector to operate with a shortened temporal window. This reduction in temporal correlation, together with a spatial compaction of the data, also leads to both computational and numerical advantages over a likelihood ratio test that is directly applied to the array data. The results of our analysis are illustrated by examples and simulations.
- [8] arXiv:2409.18719 [pdf, other]
-
Title: New flexible versions of extended generalized Pareto model for count dataComments: 17 pages, Figures 8, tables 3Subjects: Methodology (stat.ME)
Accurate modeling is essential in integer-valued real phenomena, including the distribution of entire data, zero-inflated (ZI) data, and discrete exceedances. The Poisson and Negative Binomial distributions, along with their ZI variants, are considered suitable for modeling the entire data distribution, but they fail to capture the heavy tail behavior effectively alongside the bulk of the distribution. In contrast, the discrete generalized Pareto distribution (DGPD) is preferred for high threshold exceedances, but it becomes less effective for low threshold exceedances. However, in some applications, the selection of a suitable high threshold is challenging, and the asymptotic conditions required for using DGPD are not always met. To address these limitations, extended versions of DGPD are proposed. These extensions are designed to model one of three scenarios: first, the entire distribution of the data, including both bulk and tail and bypassing the threshold selection step; second, the entire distribution along with ZI; and third, the tail of the distribution for low threshold exceedances. The proposed extensions offer improved estimates across all three scenarios compared to existing models, providing more accurate and reliable results in simulation studies and real data applications.
- [9] arXiv:2409.18782 [pdf, other]
-
Title: Non-parametric efficient estimation of marginal structural models with multi-valued time-varying treatmentsAxel Martin (1), Michele Santacatterina (1), Iván Díaz (1) ((1) Division of Biostatistics, Department of Population Health, New York University Grossman School of Medicine)Comments: 15 pages, 1 figure, 3 tablesSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
Marginal structural models are a popular method for estimating causal effects in the presence of time-varying exposures. In spite of their popularity, no scalable non-parametric estimator exist for marginal structural models with multi-valued and time-varying treatments. In this paper, we use machine learning together with recent developments in semiparametric efficiency theory for longitudinal studies to propose such an estimator. The proposed estimator is based on a study of the non-parametric identifying functional, including first order von-Mises expansions as well as the efficient influence function and the efficiency bound. We show conditions under which the proposed estimator is efficient, asymptotically normal, and sequentially doubly robust in the sense that it is consistent if, for each time point, either the outcome or the treatment mechanism is consistently estimated. We perform a simulation study to illustrate the properties of the estimators, and present the results of our motivating study on a COVID-19 dataset studying the impact of mobility on the cumulative number of observed cases.
- [10] arXiv:2409.18908 [pdf, other]
-
Title: Inference with Sequential Monte-Carlo Computation of $p$-values: Fast and Valid ApproachesSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Hypothesis tests calibrated by (re)sampling methods (such as permutation, rank and bootstrap tests) are useful tools for statistical analysis, at the computational cost of requiring Monte-Carlo sampling for calibration. It is common and almost universal practice to execute such tests with predetermined and large number of Monte-Carlo samples, and disregard any randomness from this sampling at the time of drawing and reporting inference. At best, this approach leads to computational inefficiency, and at worst to invalid inference. That being said, a number of approaches in the literature have been proposed to adaptively guide analysts in choosing the number of Monte-Carlo samples, by sequentially deciding when to stop collecting samples and draw inference. These works introduce varying competing notions of what constitutes "valid" inference, complicating the landscape for analysts seeking suitable methodology. Furthermore, the majority of these approaches solely guarantee a meaningful estimate of the testing outcome, not the $p$-value itself $\unicode{x2014}$ which is insufficient for many practical applications. In this paper, we survey the relevant literature, and build bridges between the scattered validity notions, highlighting some of their complementary roles. We also introduce a new practical methodology that provides an estimate of the $p$-value of the Monte-Carlo test, endowed with practically relevant validity guarantees. Moreover, our methodology is sequential, updating the $p$-value estimate after each new Monte-Carlo sample has been drawn, while retaining important validity guarantees regardless of the selected stopping time. We conclude this paper with a set of recommendations for the practitioner, both in terms of selection of methodology and manner of reporting results.
New submissions (showing 10 of 10 entries)
- [11] arXiv:2409.18321 (cross-list from stat.ML) [pdf, other]
-
Title: Local Prediction-Powered InferenceSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
To infer a function value on a specific point $x$, it is essential to assign higher weights to the points closer to $x$, which is called local polynomial / multivariable regression. In many practical cases, a limited sample size may ruin this method, but such conditions can be improved by the Prediction-Powered Inference (PPI) technique. This paper introduced a specific algorithm for local multivariable regression using PPI, which can significantly reduce the variance of estimations without enlarge the error. The confidence intervals, bias correction, and coverage probabilities are analyzed and proved the correctness and superiority of our algorithm. Numerical simulation and real-data experiments are applied and show these conclusions. Another contribution compared to PPI is the theoretical computation efficiency and explainability by taking into account the dependency of the dependent variable.
- [12] arXiv:2409.18374 (cross-list from stat.ML) [pdf, other]
-
Title: Adaptive Learning of the Latent Space of Wasserstein Generative Adversarial NetworksSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
Generative models based on latent variables, such as generative adversarial networks (GANs) and variational auto-encoders (VAEs), have gained lots of interests due to their impressive performance in many fields. However, many data such as natural images usually do not populate the ambient Euclidean space but instead reside in a lower-dimensional manifold. Thus an inappropriate choice of the latent dimension fails to uncover the structure of the data, possibly resulting in mismatch of latent representations and poor generative qualities. Towards addressing these problems, we propose a novel framework called the latent Wasserstein GAN (LWGAN) that fuses the Wasserstein auto-encoder and the Wasserstein GAN so that the intrinsic dimension of the data manifold can be adaptively learned by a modified informative latent distribution. We prove that there exist an encoder network and a generator network in such a way that the intrinsic dimension of the learned encoding distribution is equal to the dimension of the data manifold. We theoretically establish that our estimated intrinsic dimension is a consistent estimate of the true dimension of the data manifold. Meanwhile, we provide an upper bound on the generalization error of LWGAN, implying that we force the synthetic data distribution to be similar to the real data distribution from a population perspective. Comprehensive empirical experiments verify our framework and show that LWGAN is able to identify the correct intrinsic dimension under several scenarios, and simultaneously generate high-quality synthetic data by sampling from the learned latent distribution.
- [13] arXiv:2409.18643 (cross-list from q-fin.RM) [pdf, other]
-
Title: Tail Risk Analysis for Financial Time SeriesComments: Book chapter to appear in the Handbook on Statistics of Extremes (Chapman & Hall / CRC)Subjects: Risk Management (q-fin.RM); Applications (stat.AP); Methodology (stat.ME)
This book chapter illustrates how to apply extreme value statistics to financial time series data. Such data often exhibits strong serial dependence, which complicates assessment of tail risks. We discuss the two main approches to tail risk estimation, unconditional and conditional quantile forecasting. We use the S&P 500 index as a case study to assess serial (extremal) dependence, perform an unconditional and conditional risk analysis, and apply backtesting methods. Additionally, the chapter explores the impact of serial dependence on multivariate tail dependence.
Cross submissions (showing 3 of 3 entries)
- [14] arXiv:1709.01050 (replaced) [pdf, other]
-
Title: Modeling Interference Via Symmetric Treatment DecompositionSubjects: Methodology (stat.ME)
Classical causal inference assumes treatments meant for a given unit do not have an effect on other units. This assumption is violated in interference problems, where new types of spillover causal effects arise, and causal inference becomes much more difficult. In addition, interference introduces a unique complication where variables may transmit treatment influences to each other, which is a relationship that has some features of a causal one, but is symmetric.
In this paper, we develop a new approach to decomposing the spillover effect into unit-specific components that extends the DAG based treatment decomposition approach to mediation of Robins and Richardson to causal models that admit stable symmetric relationships among variables in a network. We discuss two interpretations of such models: a network structural model interpretation, and an interpretation based on equilibrium of structural equation models discussed in (Lauritzen and Richardson, 2002). We show that both interpretations yield identical identification theory, and give conditions for components of the spillover effect to be identified.
We discuss statistical inference for identified components of the spillover effect, including a maximum likelihood estimator, and a doubly robust estimator for the special case of two interacting outcomes. We verify consistency and robustness of our estimators via a simulation study, and illustrate our method by assessing the causal effect of education attainment on depressive symptoms using the data on households from the Wisconsin Longitudinal Study. - [15] arXiv:2307.07068 (replaced) [pdf, other]
-
Title: Scalable Resampling in Massive Generalized Linear Models via Subsampled Residual BootstrapSubjects: Methodology (stat.ME)
Residual bootstrap is a classical method for statistical inference in regression settings. With massive data sets becoming increasingly common, there is a demand for computationally efficient alternatives to residual bootstrap. We propose a simple and versatile scalable algorithm called subsampled residual bootstrap (SRB) for generalized linear models (GLMs), a large class of regression models that includes the classical linear regression model as well as other widely used models such as logistic, Poisson and probit regression. We prove consistency and distributional results that establish that the SRB has the same theoretical guarantees under the GLM framework as the classical residual bootstrap, while being computationally much faster. We demonstrate the empirical performance of SRB via simulation studies and a real data analysis of the Forest Covertype data from the UCI Machine Learning Repository.
- [16] arXiv:2401.07294 (replaced) [pdf, other]
-
Title: Multilevel Metamodels: A Novel Approach to Enhance Efficiency and Generalizability in Monte Carlo Simulation StudiesSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Metamodels, or the regression analysis of Monte Carlo simulation results, provide a powerful tool to summarize simulation findings. However, an underutilized approach is the multilevel metamodel (MLMM) that accounts for the dependent data structure that arises from fitting multiple models to the same simulated data set. In this study, we articulate the theoretical rationale for the MLMM and illustrate how it can improve the interpretability of simulation results, better account for complex simulation designs, and provide new insights into the generalizability of simulation findings.
- [17] arXiv:2401.11263 (replaced) [pdf, other]
-
Title: Estimating Heterogeneous Treatment Effects on Survival Outcomes Using Counterfactual Censoring Unbiased TransformationsSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
Methods for estimating heterogeneous treatment effects (HTE) from observational data have largely focused on continuous or binary outcomes, with less attention paid to survival outcomes and almost none to settings with competing risks. In this work, we develop censoring unbiased transformations (CUTs) for survival outcomes both with and without competing risks. After converting time-to-event outcomes using these CUTs, direct application of HTE learners for continuous outcomes yields consistent estimates of heterogeneous cumulative incidence effects, total effects, and separable direct effects. Our CUTs enable application of a much larger set of state of the art HTE learners for censored outcomes than had previously been available, especially in competing risks settings. We provide generic model-free learner-specific oracle inequalities bounding the finite-sample excess risk. The oracle efficiency results depend on the oracle selector and estimated nuisance functions from all steps involved in the transformation. We demonstrate the empirical performance of the proposed methods in simulation studies.
- [18] arXiv:2403.00304 (replaced) [pdf, other]
-
Title: Coherent forecasting of NoGeAR(1) modelSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
This article focuses on the coherent forecasting of the recently introduced novel geometric AR(1) (NoGeAR(1)) model - an INAR model based on inflated - parameter binomial thinning approach. Various techniques are available to achieve h - step ahead coherent forecasts of count time series, like median and mode forecasting. However, there needs to be more body of literature addressing coherent forecasting in the context of overdispersed count time series. Here, we study the forecasting distribution corresponding to NoGeAR(1) process using the Monte Carlo (MC) approximation method. Accordingly, several forecasting measures are employed in the simulation study to facilitate a thorough comparison of the forecasting capability of NoGeAR(1) with other models. The methodology is also demonstrated using real-life data, specifically the data on CWß TeXpert downloads and Barbados COVID-19 data.
- [19] arXiv:2404.15017 (replaced) [pdf, other]
-
Title: The mosaic permutation test: an exact and nonparametric goodness-of-fit test for factor modelsComments: 42 pages, 13 figuresSubjects: Methodology (stat.ME)
Financial firms often rely on fundamental factor models to explain correlations among asset returns and manage risk. Yet after major events, e.g., COVID-19, analysts may reassess whether existing risk models continue to fit well: specifically, after accounting for a set of known factor exposures, are the residuals of the asset returns independent? With this motivation, we introduce the mosaic permutation test, a nonparametric goodness-of-fit test for preexisting factor models. Our method can leverage modern machine learning techniques to detect model violations while provably controlling the false positive rate, i.e., the probability of rejecting a well-fitting model, without making asymptotic approximations or parametric assumptions. This property helps prevent analysts from unnecessarily rebuilding accurate models, which can waste resources and increase risk. To illustrate our methodology, we apply the mosaic permutation test to the BlackRock Fundamental Equity Risk (BFRE) model. Although the BFRE model generally explains the most significant correlations among assets, we find evidence of unexplained correlations among certain real estate stocks, and we show that adding new factors improves model fit. We implement our methods in the python package mosaicperm.
- [20] arXiv:2404.17734 (replaced) [pdf, other]
-
Title: Manipulating a Continuous Instrumental Variable in an Observational Study of Premature Babies: Algorithm, Partial Identification Bounds, and Inference under Randomization and Biased Randomization AssumptionsSubjects: Methodology (stat.ME); Applications (stat.AP)
Regionalization of intensive care for premature babies refers to a triage system of mothers with high-risk pregnancies to hospitals of varied capabilities based on risks faced by infants. Due to the limited capacity of high-level hospitals, which are equipped with advanced expertise to provide critical care, understanding the effect of delivering premature babies at such hospitals on infant mortality for different subgroups of high-risk mothers could facilitate the design of an efficient perinatal regionalization system. Towards answering this question, Baiocchi et al. (2010) proposed to strengthen an excess-travel-time-based, continuous instrumental variable (IV) in an IV-based, matched-pair design by switching focus to a smaller cohort amenable to being paired with a larger separation in the IV dose. Three elements changed with the strengthened IV: the study cohort, compliance rate and latent complier subgroup. Here, we introduce a non-bipartite, template matching algorithm that embeds data into a target, pair-randomized encouragement trial which maintains fidelity to the original study cohort while strengthening the IV. We then study randomization-based and IV-dependent, biased-randomization-based inference of partial identification bounds for the sample average treatment effect (SATE) in an IV-based matched pair design, which deviates from the usual effect ratio estimand in that the SATE is agnostic to the IV and who is matched to whom, although a strengthened IV design could narrow the partial identification bounds. Based on our proposed strengthened-IV design, we found that delivering at a high-level NICU reduced preterm babies' mortality rate compared to a low-level NICU for $81,766 \times 2 = 163,532$ mothers and their preterm babies and the effect appeared to be minimal among non-black, low-risk mothers.
- [21] arXiv:2407.09371 (replaced) [pdf, other]
-
Title: Computationally Efficient Estimation of Large Probit ModelsSubjects: Methodology (stat.ME); Econometrics (econ.EM); Computation (stat.CO)
Probit models are useful for modeling correlated discrete responses in many disciplines, including consumer choice data in economics and marketing. However, the Gaussian latent variable feature of probit models coupled with identification constraints pose significant computational challenges for its estimation and inference, especially when the dimension of the discrete response variable is large. In this paper, we propose a computationally efficient Expectation-Maximization (EM) algorithm for estimating large probit models. Our work is distinct from existing methods in two important aspects. First, instead of simulation or sampling methods, we apply and customize expectation propagation (EP), a deterministic method originally proposed for approximate Bayesian inference, to estimate moments of the truncated multivariate normal (TMVN) in the E (expectation) step. Second, we take advantage of a symmetric identification condition to transform the constrained optimization problem in the M (maximization) step into a one-dimensional problem, which is solved efficiently using Newton's method instead of off-the-shelf solvers. Our method enables the analysis of correlated choice data in the presence of more than 100 alternatives, which is a reasonable size in modern applications, such as online shopping and booking platforms, but has been difficult in practice with probit models. We apply our probit estimation method to study ordering effects in hotel search results on Expedia's online booking platform.
- [22] arXiv:2212.09900 (replaced) [pdf, other]
-
Title: Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequalitySubjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
This paper studies offline policy learning, which aims at utilizing observations collected a priori (from either fixed or adaptively evolving behavior policies) to learn an optimal individualized decision rule that achieves the best overall outcomes for a given population. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded. As one has no control over the data collection process, this assumption can be unrealistic in many situations, especially when the behavior policies are allowed to evolve over time with diminishing propensities for certain actions.
In this paper, we propose Pessimistic Policy Learning (PPL), a new algorithm that optimizes lower confidence bounds (LCBs) -- instead of point estimates -- of the policy values. The LCBs are constructed using knowledge of the behavior policies for collecting the offline data. Without assuming any uniform overlap condition, we establish a data-dependent upper bound for the suboptimality of our algorithm, which only depends on (i) the overlap for the optimal policy, and (ii) the complexity of the policy class we optimize over. As an implication, for adaptively collected data, we ensure efficient policy learning as long as the propensities for optimal actions are lower bounded over time, while those for suboptimal ones are allowed to diminish arbitrarily fast. In our theoretical analysis, we develop a new self-normalized type concentration inequality for inverse-propensity-weighting estimators, generalizing the well-known empirical Bernstein's inequality to unbounded and non-i.i.d. data. We complement our theory with an efficient optimization algorithm via Majorization-Minimization and policy tree search, as well as extensive simulation studies and real-world applications that demonstrate the efficacy of PPL. - [23] arXiv:2401.03820 (replaced) [pdf, other]
-
Title: Optimal Differentially Private PCA and Estimation for Spiked Covariance MatricesSubjects: Statistics Theory (math.ST); Information Theory (cs.IT); Methodology (stat.ME); Machine Learning (stat.ML)
Estimating a covariance matrix and its associated principal components is a fundamental problem in contemporary statistics. While optimal estimation procedures have been developed with well-understood properties, the increasing demand for privacy preservation introduces new complexities to this classical problem. In this paper, we study optimal differentially private Principal Component Analysis (PCA) and covariance estimation within the spiked covariance model. We precisely characterize the sensitivity of eigenvalues and eigenvectors under this model and establish the minimax rates of convergence for estimating both the principal components and covariance matrix. These rates hold up to logarithmic factors and encompass general Schatten norms, including spectral norm, Frobenius norm, and nuclear norm as special cases. We propose computationally efficient differentially private estimators and prove their minimax optimality for sub-Gaussian distributions, up to logarithmic factors. Additionally, matching minimax lower bounds are established. Notably, compared to the existing literature, our results accommodate a diverging rank, a broader range of signal strengths, and remain valid even when the sample size is much smaller than the dimension, provided the signal strength is sufficiently strong. Both simulation studies and real data experiments demonstrate the merits of our method.