# Statistics Theory

## New submissions

[ total of 23 entries: 1-23 ]
[ showing up to 2000 entries per page: fewer | more ]

### New submissions for Tue, 20 Mar 18

[1]
Title: On Polyhedral Estimation of Signals via Indirect Observations
Subjects: Statistics Theory (math.ST)

We consider the problem of recovering linear image of unknown signal belonging to a given convex compact signal set from noisy observation of another linear image of the signal. We develop a simple generic efficiently computable nonlinear in observations "polyhedral" estimate along with computation-friendly techniques for its design and risk analysis. We demonstrate that under favorable circumstances the resulting estimate is provably near-optimal in the minimax sense, the "favorable circumstances" being less restrictive than the weakest known so far assumptions ensuring near-optimality of estimates which are linear in observations.

[2]
Title: Optimal Designs for the Generalized Partial Credit Model
Subjects: Statistics Theory (math.ST); Methodology (stat.ME)

Analyzing ordinal data becomes increasingly important in psychology, especially in the context of item response theory. The generalized partial credit model (GPCM) is probably the most widely used ordinal model and finds application in many large scale educational assessment studies such as PISA. In the present paper, optimal test designs are investigated for estimating persons' abilities with the GPCM for calibrated tests when item parameters are known from previous studies. We will derive that local optimality may be achieved by assigning non-zero probability only to the first and last category independently of a person's ability. That is, when using such a design, the GPCM reduces to the dichotomous 2PL model. Since locally optimal designs require the true ability to be known, we consider alternative Bayesian design criteria using weight distributions over the ability parameter space. For symmetric weight distributions, we derive necessary conditions for the optimal one-point design of two response categories to be Bayes optimal. Furthermore, we discuss examples of common symmetric weight distributions and investigate, in which cases the necessary conditions are also sufficient. Since the 2PL model is a special case of the GPCM, all of these results hold for the 2PL model as well.

[3]
Title: Signal detection via Phi-divergences for general mixtures
Authors: Marc Ditzhaus
Subjects: Statistics Theory (math.ST)

In this paper we are interested in testing whether there are any signals hidden in high dimensional noise data. Therefore we study the family of goodness-of-fit tests based on $\Phi$-divergences including the test of Berk and Jones as well as Tukey's higher criticism test. The optimality of this family is already known for the heterogeneous normal mixture model. We now present a technique to transfer this optimality to more general models. For illustration we apply our results to dense signal and sparse signal models including the exponential-$\chi^2$ mixture model and general exponential families as the normal, exponential and Gumbel distribution. Beside the optimality of the whole family we discuss the power behavior on the detection boundary and show that the whole family has no power there, whereas the likelihood ratio test does.

[4]
Title: Characterizations of the Logistic and Related Distributions
Comments: 17 pages, Journal of Mathematical Analysis and Applications (2018)
Subjects: Statistics Theory (math.ST); Probability (math.PR)

It is known that few characterization results of the logistic distribution were available before, although it is similar in shape to the normal one whose characteristic properties have been well investigated. Fortunately, in the last decade, several authors have made great progress in this topic. Some interesting characterization results of the logistic distribution have been developed recently. In this paper, we further provide some new results by the distributional equalities in terms of order statistics of the underlying distribution and the random exponential shifts. The characterization of the closely related Pareto type II distribution is also investigated.

[5]
Title: High Dimensional Linear Regression using Lattice Basis Reduction
Subjects: Statistics Theory (math.ST); Probability (math.PR); Machine Learning (stat.ML)

We consider a high dimensional linear regression problem where the goal is to efficiently recover an unknown vector $\beta^*$ from $n$ noisy linear observations $Y=X\beta^*+W \in \mathbb{R}^n$, for known $X \in \mathbb{R}^{n \times p}$ and unknown $W \in \mathbb{R}^n$. Unlike most of the literature on this model we make no sparsity assumption on $\beta^*$. Instead we adopt a regularization based on assuming that the underlying vectors $\beta^*$ have rational entries with the same denominator $Q \in \mathbb{Z}_{>0}$. We call this $Q$-rationality assumption.
We propose a new polynomial-time algorithm for this task which is based on the seminal Lenstra-Lenstra-Lovasz (LLL) lattice basis reduction algorithm. We establish that under the $Q$-rationality assumption, our algorithm recovers exactly the vector $\beta^*$ for a large class of distributions for the iid entries of $X$ and non-zero noise $W$. We prove that it is successful under small noise, even when the learner has access to only one observation ($n=1$). Furthermore, we prove that in the case of the Gaussian white noise for $W$, $n=o\left(p/\log p\right)$ and $Q$ sufficiently large, our algorithm tolerates a nearly optimal information-theoretic level of the noise.

[6]
Title: Towards "simultaneous selective inference": post-hoc bounds on the false discovery proportion
Subjects: Statistics Theory (math.ST)

Some pitfalls of the false discovery rate (FDR) as an error criterion for multiple testing of $n$ hypotheses include (a) committing to an error level $q$ in advance limits its use in exploratory data analysis, and (b) controlling the false discovery proportion (FDP) on average provides no guarantee on its variability. We take a step towards overcoming these barriers using a new perspective we call "simultaneous selective inference." Many FDR procedures (such as Benjamini-Hochberg) can be viewed as carving out a $\textit{path}$ of potential rejection sets $\varnothing = \mathcal R_0 \subseteq \mathcal R_1 \subseteq \cdots \subseteq \mathcal R_n \subseteq [n]$, assigning some algorithm-dependent estimate $\widehat{\text{FDP}}(\mathcal R_k)$ to each one. Then, they choose $k^* = \max\{k: \widehat{\text{FDP}}(\mathcal R_k) \leq q\}$. We prove that for all these algorithms, given independent null p-values and a confidence level $\alpha$, either the same $\widehat{FDP}$ or a minor variant thereof bounds the unknown FDP to within a small explicit (algorithm-dependent) constant factor $c_{\text{alg}}(\alpha)$, uniformly across the entire path, with probability $1-\alpha$. Our bounds open up a middle ground between fully simultaneous inference (guarantees for all $2^n$ possible rejection sets), and fully selective inference (guarantees only for $\mathcal R_{k^*}$). They allow the scientist to $\textit{spot}$ one or more suitable rejection sets (Select Post-hoc On the algorithm's Trajectory), by picking data-dependent sizes or error-levels, after examining the entire path of $\widehat{\text{FDP}}(\mathcal R_k)$ and the uniform upper band on $\text{FDP}$. The price for the additional flexibility of spotting is small, for example the multiplier for BH corresponding to 95% confidence is approximately 2.

[7]
Title: Auxiliary information : the raking-ratio empirical process
Subjects: Statistics Theory (math.ST)

We study the empirical measure associated to a sample of size $n$ and modified by $N$ iterations of the raking-ratio method. The empirical measure is adjusted to match the true probability of sets in a finite partition which changes each step. We establish asymptotic properties of the raking-ratio empirical process indexed by functions as $n\rightarrow +\infty$, for $N$ fixed. A closed-form expression of the limiting covariance matrices is derived as $N\rightarrow +\infty$. The nonasymptotic Gaussian approximation we use also yields uniform Berry-Esseen type bounds in $n, N$ and sharp estimates of the uniform quadratic risk reduction. In the two-way contingency table formulas characterizing the limiting process are very simple.

[8]
Title: A modern maximum-likelihood theory for high-dimensional logistic regression
Comments: 25 pages, 12 figures, 4 tables
Subjects: Statistics Theory (math.ST); Methodology (stat.ME)

Every student in statistics or data science learns early on that when the sample size largely exceeds the number of variables, fitting a logistic model produces estimates that are approximately unbiased. Every student also learns that there are formulas to predict the variability of these estimates which are used for the purpose of statistical inference; for instance, to produce p-values for testing the significance of regression coefficients. Although these formulas come from large sample asymptotics, we are often told that we are on reasonably safe grounds when $n$ is large in such a way that $n \ge 5p$ or $n \ge 10p$. This paper shows that this is absolutely not the case. Consequently, inferences routinely produced by common software packages are unreliable.
Consider a logistic model with independent features in which $n$ and $p$ become increasingly large in a fixed ratio. Then we show that (1) the MLE is biased, (2) the variability of the MLE is far greater than classically predicted, and (3) the commonly used likelihood-ratio test (LRT) is not distributed as a chi-square. The bias of the MLE is extremely problematic as it yields completely wrong predictions for the probability of a case based on observed values of the covariates. We develop a new theory, which asymptotically predicts (1) the bias of the MLE, (2) the variability of the MLE, and (3) the distribution of the LRT. We empirically also demonstrate that these predictions are extremely accurate in finite samples. Further, an appealing feature is that these novel predictions depend on the unknown sequence of regression coefficients only through a single scalar, the overall strength of the signal. This suggests very concrete procedures to adjust inference; we describe one such procedure learning a single parameter from data and producing accurate inference.

[9]
Title: Numerical Integration on Graphs: where to sample and how to weigh
Subjects: Statistics Theory (math.ST); Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)

Let $G=(V,E,w)$ be a finite, connected graph with weighted edges. We are interested in the problem of finding a subset $W \subset V$ of vertices and weights $a_w$ such that $$\frac{1}{|V|}\sum_{v \in V}^{}{f(v)} \sim \sum_{w \in W}{a_w f(w)}$$ for functions $f:V \rightarrow \mathbb{R}$ that are smooth' with respect to the geometry of the graph. The main application are problems where $f$ is known to somehow depend on the underlying graph but is expensive to evaluate on even a single vertex. We prove an inequality showing that the integration problem can be rewritten as a geometric problem (the optimal packing of heat balls'). We discuss how one would construct approximate solutions of the heat ball packing problem; numerical examples demonstrate the efficiency of the method.

[10]
Title: Optimal link prediction with matrix logistic regression
Subjects: Statistics Theory (math.ST); Machine Learning (stat.ML)

We consider the problem of link prediction, based on partial observation of a large network, and on side information associated to its vertices. The generative model is formulated as a matrix logistic regression. The performance of the model is analysed in a high-dimensional regime under a structural assumption. The minimax rate for the Frobenius-norm risk is established and a combinatorial estimator based on the penalised maximum likelihood approach is shown to achieve it. Furthermore, it is shown that this rate cannot be attained by any (randomised) algorithm computable in polynomial time under a computational complexity assumption.

### Cross-lists for Tue, 20 Mar 18

[11]  arXiv:1803.06505 (cross-list from stat.CO) [pdf, ps, other]
Title: A simulated annealing procedure based on the ABC Shadow algorithm for statistical inference of point processes
Subjects: Computation (stat.CO); Statistics Theory (math.ST)

Recently a new algorithm for sampling posteriors of unnormalised probability densities, called ABC Shadow, was proposed in [8]. This talk introduces a global optimisation procedure based on the ABC Shadow simulation dynamics. First the general method is explained, and then results on simulated and real data are presented. The method is rather general, in the sense that it applies for probability densities that are continuously differentiable with respect to their parameters

[12]  arXiv:1803.06510 (cross-list from stat.ML) [pdf, other]
Title: Hidden Integrality of SDP Relaxation for Sub-Gaussian Mixture Models
Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)

We consider the problem of estimating the discrete clustering structures under Sub-Gaussian Mixture Models. Our main results establish a hidden integrality property of a semidefinite programming (SDP) relaxation for this problem: while the optimal solutions to the SDP are not integer-valued in general, their estimation errors can be upper bounded in terms of the error of an idealized integer program. The error of the integer program, and hence that of the SDP, are further shown to decay exponentially in the signal-to-noise ratio. To the best of our knowledge, this is the first exponentially decaying error bound for convex relaxations of mixture models, and our results reveal the "global-to-local" mechanism that drives the performance of the SDP relaxation.
A corollary of our results shows that in certain regimes the SDP solutions are in fact integral and exact, improving on existing exact recovery results for convex relaxations. More generally, our results establish sufficient conditions for the SDP to correctly recover the cluster memberships of $(1-\delta)$ fraction of the points for any $\delta\in(0,1)$. As a special case, we show that under the $d$-dimensional Stochastic Ball Model, SDP achieves non-trivial (sometimes exact) recovery when the center separation is as small as $\sqrt{1/d}$, which complements previous exact recovery results that require constant separation.

[13]  arXiv:1803.06675 (cross-list from stat.ME) [pdf, other]
Title: Rare Feature Selection in High Dimensions
Comments: 32 pages, 9 figures
Subjects: Methodology (stat.ME); Statistics Theory (math.ST); Computation (stat.CO); Machine Learning (stat.ML)

It is common in modern prediction problems for many predictor variables to be counts of rarely occurring events. This leads to design matrices in which many columns are highly sparse. The challenge posed by such "rare features" has received little attention despite its prevalence in diverse areas, ranging from natural language processing (e.g., rare words) to biology (e.g., rare species). We show, both theoretically and empirically, that not explicitly accounting for the rareness of features can greatly reduce the effectiveness of an analysis. We next propose a framework for aggregating rare features into denser features in a flexible manner that creates better predictors of the response. Our strategy leverages side information in the form of a tree that encodes feature similarity.
We apply our method to data from TripAdvisor, in which we predict the numerical rating of a hotel based on the text of the associated review. Our method achieves high accuracy by making effective use of rare words; by contrast, the lasso is unable to identify highly predictive words if they are too rare. A companion R package, called rare, implements our new estimator, using the alternating direction method of multipliers.

[14]  arXiv:1803.06727 (cross-list from cs.LG) [pdf, other]
Title: Aggregating Strategies for Long-term Forecasting
Comments: 20 pages, 4 figures
Subjects: Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST); Machine Learning (stat.ML)

The article is devoted to investigating the application of aggregating algorithms to the problem of the long-term forecasting. We examine the classic aggregating algorithms based on the exponential reweighing. For the general Vovk's aggregating algorithm we provide its generalization for the long-term forecasting. For the special basic case of Vovk's algorithm we provide its two modifications for the long-term forecasting. The first one is theoretically close to an optimal algorithm and is based on replication of independent copies. It provides the time-independent regret bound with respect to the best expert in the pool. The second one is not optimal but is more practical and has $O(\sqrt{T})$ regret bound, where $T$ is the length of the game.

[15]  arXiv:1803.06971 (cross-list from stat.ML) [pdf, other]
Title: What Doubling Tricks Can and Can't Do for Multi-Armed Bandits
Authors: Lilian Besson (IETR), Emilie Kaufmann (SEQUEL, CNRS)
Subjects: Machine Learning (stat.ML); Learning (cs.LG); Statistics Theory (math.ST)

An online reinforcement learning algorithm is anytime if it does not need to know in advance the horizon T of the experiment. A well-known technique to obtain an anytime algorithm from any non-anytime algorithm is the "Doubling Trick". In the context of adversarial or stochastic multi-armed bandits, the performance of an algorithm is measured by its regret, and we study two families of sequences of growing horizons (geometric and exponential) to generalize previously known results that certain doubling tricks can be used to conserve certain regret bounds. In a broad setting, we prove that a geometric doubling trick can be used to conserve (minimax) bounds in $R\_T = O(\sqrt{T})$ but cannot conserve (distribution-dependent) bounds in $R\_T = O(\log T)$. We give insights as to why exponential doubling tricks may be better, as they conserve bounds in $R\_T = O(\log T)$, and are close to conserving bounds in $R\_T = O(\sqrt{T})$.

### Replacements for Tue, 20 Mar 18

[16]  arXiv:1704.06762 (replaced) [pdf, ps, other]
Title: Estimation for multiplicative models under multinomial sampling
Authors: Antonio Forcina
Subjects: Statistics Theory (math.ST)
[17]  arXiv:1801.09269 (replaced) [pdf, ps, other]
Title: Wasserstein Riemannian Geometry of Positive Definite Matrices
Comments: Second version. Submitted
Subjects: Statistics Theory (math.ST)
[18]  arXiv:1802.00982 (replaced) [pdf, ps, other]
Title: Stochastic integral with respect to the mixed fractional Brownian motion and drift estimation of the mixed fraction Ornstein-Ulenbeck process
Subjects: Statistics Theory (math.ST)
[19]  arXiv:1606.01200 (replaced) [pdf, ps, other]
Title: Simple and Honest Confidence Intervals in Nonparametric Regression
Subjects: Applications (stat.AP); Statistics Theory (math.ST)
[20]  arXiv:1608.02199 (replaced) [pdf, other]
Title: An EM algorithm for absolutely continuous Marshall-Olkin bivariate Pareto distribution with location and scale
Comments: 17 pages, 4 figures
Subjects: Computation (stat.CO); Statistics Theory (math.ST)
[21]  arXiv:1704.07531 (replaced) [pdf, other]
Title: Sufficient Markov Decision Processes with Alternating Deep Neural Networks
Comments: 31 pages, 3 figures, extended abstract in the proceedings of RLDM2017. (v2 revisions: Fixed a minor bug in the code w.r.t. setting seed, as a result numbers in the simulation experiments had some slight changes, but conclusions stayed the same. Corrected typos. Improved notations.)
Subjects: Methodology (stat.ME); Statistics Theory (math.ST); Machine Learning (stat.ML)
[22]  arXiv:1710.03830 (replaced) [pdf, other]
Title: Inference on Auctions with Weak Assumptions on Information
Subjects: Econometrics (econ.EM); Computer Science and Game Theory (cs.GT); Learning (cs.LG); Statistics Theory (math.ST)
[23]  arXiv:1802.08667 (replaced) [pdf, ps, other]
Title: Double/De-Biased Machine Learning Using Regularized Riesz Representers
Comments: 15 pages; fixed several typos + updated references
Subjects: Machine Learning (stat.ML); Econometrics (econ.EM); Statistics Theory (math.ST)
[ total of 23 entries: 1-23 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)