cicyt UNIZAR

Statistics

New submissions

[ total of 116 entries: 1-116 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Tue, 20 Mar 18

[1]  arXiv:1803.06344 [pdf, other]
Title: A Multi-Scheme Ensemble Using Coopetitive Soft-Gating With Application to Power Forecasting for Renewable Energy Generation
Subjects: Applications (stat.AP); Machine Learning (stat.ML)

In this article, we propose a novel ensemble technique with a multi-scheme weighting based on a technique called coopetitive soft gating. This technique combines both, ensemble member competition and cooperation, in order to maximize the overall forecasting accuracy of the ensemble. The proposed algorithm combines the ideas of multiple ensemble paradigms (power forecasting model ensemble, weather forecasting model ensemble, and lagged ensemble) in a hierarchical structure. The technique is designed to be used in a flexible manner on single and multiple weather forecasting models, and for a variety of lead times. We compare the technique to other power forecasting models and ensemble techniques with a flexible number of weather forecasting models, which can have the same, or varying forecasting horizons. It is shown that the model is able to outperform those models on a number of publicly available data sets. The article closes with a discussion of properties of the proposed model which are relevant in its application.

[2]  arXiv:1803.06365 [pdf, ps, other]
Title: Inference for case-control studies with incident and prevalent cases
Comments: 19 pages, 4 figures
Subjects: Methodology (stat.ME)

We propose and study a fully efficient method to estimate associations of an exposure with disease incidence when both, incident cases and prevalent cases, i.e. individuals who were diagnosed with the disease at some prior time point and are alive at the time of sampling, are included in a case-control study. We extend the exponential tilting model for the relationship between exposure and case status to accommodate two case groups, and correct for the survival bias in the prevalent cases through a tilting term that depends on the parametric distribution of the backward time, i.e. the time from disease diagnosis to study enrollment. We construct an empirical likelihood that also incorporates the observed backward times for prevalent cases, obtain efficient estimates of odds ratio parameters that relate exposure to disease incidence and propose a likelihood ratio test for model parameters that has a standard chi-squared distribution. We quantify the changes in efficiency of association parameters when incident cases are supplemented with, or replaced by, prevalent cases in simulations. We illustrate our methods by estimating associations of single nucleotide polymorphisms (SNPs) with breast cancer incidence in a sample of controls and incident and prevalent cases from the U.S. Radiologic Technologists Health Study.

[3]  arXiv:1803.06383 [pdf, ps, other]
Title: A prediction criterion for working correlation structure selection in GEE
Comments: 28 pages, 1 figure, 20 tables
Subjects: Methodology (stat.ME)

Generalized estimating equations (GEE) is one of the most commonly used methods for marginal regression analysis of longitudinal data, especially with discrete outcomes. The GEE method models the association among the responses of a subject through a working correlation matrix and correct specification of the working correlation structure ensures efficient estimation of the regression parameters. This study proposes a predicted residual sum of squares (PRESS) statistic as a working correlation selection criterion in GEE. An extensive simulation study is designed to assess the performance of the proposed GEE PRESS criterion and to compare its performance with well-known existing criteria in the literature. The results show that the GEE PRESS criterion has better performance than the weighted error sum of squares SC criterion in all cases and is comparable with that of other existing criteria when the true working correlation structure is AR(1) or exchangeable. Lastly, the working correlation selection criteria are illustrated with the Coronary Artery Risk Development in Young Adults study.

[4]  arXiv:1803.06387 [pdf, other]
Title: Improving the efficiency and robustness of nested sampling using posterior repartitioning
Subjects: Computation (stat.CO)

In real-world Bayesian inference applications, prior assumptions regarding the parameters of interest may be unrepresentative of their actual values for a given dataset. In particular, if the likelihood is concentrated far out in the wings of the assumed prior distribution, this can lead to extremely inefficient exploration of the resulting posterior by nested sampling algorithms, with unnecessarily high associated computational costs. Simple solutions such as broadening the prior range in such cases might not be appropriate or possible in real-world applications, for example when one wishes to assume a single standardised prior across the analysis of a large number of datasets for which the true values of the parameters of interest may vary. This work therefore introduces a posterior repartitioning (PR) method for nested sampling algorithms, which addresses the problem by redefining the likelihood and prior while keeping their product fixed, so that the posterior inferences remain unchanged but the efficiency of the nested sampling process is significantly increased. Numerical results show that the PR method provides a simple yet powerful refinement for nested sampling algorithms to address the issue of unrepresentative priors.

[5]  arXiv:1803.06393 [pdf, other]
Title: Phylogeny-based tumor subclone identification using a Bayesian feature allocation model
Comments: 35 pages, 11 figures
Subjects: Applications (stat.AP); Tissues and Organs (q-bio.TO)

Tumor cells acquire different genetic alterations during the course of evolution in cancer patients. As a result of competition and selection, only a few subgroups of cells with distinct genotypes survive. These subgroups of cells are often referred to as subclones. In recent years, many statistical and computational methods have been developed to identify tumor subclones, leading to biologically significant discoveries and shedding light on tumor progression, metastasis, drug resistance and other processes. However, most existing methods are either not able to infer the phylogenetic structure among subclones, or not able to incorporate copy number variations (CNV). In this article, we propose SIFA (tumor Subclone Identification by Feature Allocation), a Bayesian model which takes into account both CNV and tumor phylogeny structure to infer tumor subclones. We compare the performance of SIFA with two other commonly used methods using simulation studies with varying sequencing depth, evolutionary tree size, and tree complexity. SIFA consistently yields better results in terms of Rand Index and cellularity estimation accuracy. The usefulness of SIFA is also demonstrated through its application to whole genome sequencing (WGS) samples from four patients in a breast cancer study.

[6]  arXiv:1803.06446 [pdf, ps, other]
Title: On Polyhedral Estimation of Signals via Indirect Observations
Subjects: Statistics Theory (math.ST)

We consider the problem of recovering linear image of unknown signal belonging to a given convex compact signal set from noisy observation of another linear image of the signal. We develop a simple generic efficiently computable nonlinear in observations "polyhedral" estimate along with computation-friendly techniques for its design and risk analysis. We demonstrate that under favorable circumstances the resulting estimate is provably near-optimal in the minimax sense, the "favorable circumstances" being less restrictive than the weakest known so far assumptions ensuring near-optimality of estimates which are linear in observations.

[7]  arXiv:1803.06505 [pdf, ps, other]
Title: A simulated annealing procedure based on the ABC Shadow algorithm for statistical inference of point processes
Subjects: Computation (stat.CO); Statistics Theory (math.ST)

Recently a new algorithm for sampling posteriors of unnormalised probability densities, called ABC Shadow, was proposed in [8]. This talk introduces a global optimisation procedure based on the ABC Shadow simulation dynamics. First the general method is explained, and then results on simulated and real data are presented. The method is rather general, in the sense that it applies for probability densities that are continuously differentiable with respect to their parameters

[8]  arXiv:1803.06510 [pdf, other]
Title: Hidden Integrality of SDP Relaxation for Sub-Gaussian Mixture Models
Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)

We consider the problem of estimating the discrete clustering structures under Sub-Gaussian Mixture Models. Our main results establish a hidden integrality property of a semidefinite programming (SDP) relaxation for this problem: while the optimal solutions to the SDP are not integer-valued in general, their estimation errors can be upper bounded in terms of the error of an idealized integer program. The error of the integer program, and hence that of the SDP, are further shown to decay exponentially in the signal-to-noise ratio. To the best of our knowledge, this is the first exponentially decaying error bound for convex relaxations of mixture models, and our results reveal the "global-to-local" mechanism that drives the performance of the SDP relaxation.
A corollary of our results shows that in certain regimes the SDP solutions are in fact integral and exact, improving on existing exact recovery results for convex relaxations. More generally, our results establish sufficient conditions for the SDP to correctly recover the cluster memberships of $(1-\delta)$ fraction of the points for any $\delta\in(0,1)$. As a special case, we show that under the $d$-dimensional Stochastic Ball Model, SDP achieves non-trivial (sometimes exact) recovery when the center separation is as small as $\sqrt{1/d}$, which complements previous exact recovery results that require constant separation.

[9]  arXiv:1803.06517 [pdf, other]
Title: Optimal Designs for the Generalized Partial Credit Model
Subjects: Statistics Theory (math.ST); Methodology (stat.ME)

Analyzing ordinal data becomes increasingly important in psychology, especially in the context of item response theory. The generalized partial credit model (GPCM) is probably the most widely used ordinal model and finds application in many large scale educational assessment studies such as PISA. In the present paper, optimal test designs are investigated for estimating persons' abilities with the GPCM for calibrated tests when item parameters are known from previous studies. We will derive that local optimality may be achieved by assigning non-zero probability only to the first and last category independently of a person's ability. That is, when using such a design, the GPCM reduces to the dichotomous 2PL model. Since locally optimal designs require the true ability to be known, we consider alternative Bayesian design criteria using weight distributions over the ability parameter space. For symmetric weight distributions, we derive necessary conditions for the optimal one-point design of two response categories to be Bayes optimal. Furthermore, we discuss examples of common symmetric weight distributions and investigate, in which cases the necessary conditions are also sufficient. Since the 2PL model is a special case of the GPCM, all of these results hold for the 2PL model as well.

[10]  arXiv:1803.06518 [pdf, other]
Title: Provable Convex Co-clustering of Tensors
Subjects: Methodology (stat.ME); Applications (stat.AP); Computation (stat.CO); Machine Learning (stat.ML)

Cluster analysis is a fundamental tool for pattern discovery of complex heterogeneous data. Prevalent clustering methods mainly focus on vector or matrix-variate data and are not applicable to general-order tensors, which arise frequently in modern scientific and business applications. Moreover, there is a gap between statistical guarantees and computational efficiency for existing tensor clustering solutions due to the nature of their non-convex formulations. In this work, we bridge this gap by developing a provable convex formulation of tensor co-clustering. Our convex co-clustering (CoCo) estimator enjoys stability guarantees and is both computationally and storage efficient. We further establish a non-asymptotic error bound for the CoCo estimator, which reveals a surprising "blessing of dimensionality" phenomenon that does not exist in vector or matrix-variate cluster analysis. Our theoretical findings are supported by extensive simulated studies. Finally, we apply the CoCo estimator to the cluster analysis of advertisement click tensor data from a major online company. Our clustering results provide meaningful business insights to improve advertising effectiveness.

[11]  arXiv:1803.06519 [pdf, other]
Title: Signal detection via Phi-divergences for general mixtures
Authors: Marc Ditzhaus
Subjects: Statistics Theory (math.ST)

In this paper we are interested in testing whether there are any signals hidden in high dimensional noise data. Therefore we study the family of goodness-of-fit tests based on $\Phi$-divergences including the test of Berk and Jones as well as Tukey's higher criticism test. The optimality of this family is already known for the heterogeneous normal mixture model. We now present a technique to transfer this optimality to more general models. For illustration we apply our results to dense signal and sparse signal models including the exponential-$\chi^2$ mixture model and general exponential families as the normal, exponential and Gumbel distribution. Beside the optimality of the whole family we discuss the power behavior on the detection boundary and show that the whole family has no power there, whereas the likelihood ratio test does.

[12]  arXiv:1803.06536 [pdf, other]
Title: Optimal Design of Nonlinear Multifactor Experiments
Subjects: Computation (stat.CO); Applications (stat.AP)

Most chemical and biological experiments involve multiple treatment factors. Often, it is convenient to fit a nonlinear model in these factors. This nonlinear model can be mechanistic, empirical or a hybrid of the two, as long as it describes the unknown mechanism behind the response surface. Motivated by experiments in chemical engineering, we focus on D-optimal design for multifactor nonlinear response surfaces in general. In order to find and study optimal designs, we first implement conventional point (coordinate) exchange algorithms. Moreover, we develop a novel multiphase optimisation method to construct D-optimal designs with improved properties that are illustrated in our results. The benefits of this method are demonstrated by application to two experiments. For each case, highly efficient and feasible designs are found and tailored to the nonlinear regression models. Such designs can be easily used for comprehensive mechanism studies and are shown to be considerably more informative than standard designs.

[13]  arXiv:1803.06550 [pdf, other]
Title: On Mahalanobis distance in functional settings
Comments: 27 pages, 4 figures, 3 tables
Subjects: Methodology (stat.ME)

Mahalanobis distance is a classical tool in multivariate analysis. We suggest here an extension of this concept to the case of functional data. More precisely, the proposed definition concerns those statistical problems where the sample data are real functions defined on a compact interval of the real line. The obvious difficulty for such a functional extension is the non-invertibility of the covariance operator in infinite-dimensional cases. Unlike other recent proposals, our definition is suggested and motivated in terms of the Reproducing Kernel Hilbert Space (RKHS) associated with the stochastic process that generates the data. The proposed distance is a true metric; it depends on a unique real smoothing parameter which is fully motivated in RKHS terms. Moreover, it shares some properties of its finite dimensional counterpart: it is invariant under isometries, it can be consistently estimated from the data and its sampling distribution is known under Gaussian models. An empirical study for two statistical applications, outliers detection and binary classification, is included. The obtained results are quite competitive when compared to other recent proposals of the literature.

[14]  arXiv:1803.06557 [pdf, other]
Title: Consistent estimation of treatment effects under heterogeneous heteroskedasticity
Comments: 37 pages, 27 figures
Subjects: Methodology (stat.ME)

The empirical literature on program evaluation limits its scope almost exclusively to models where treatment effects are homogenous for observationally identical individuals. This paper considers a treatment effect model in which treatment effects may be heterogeneous, even among observationally identical individuals. Specifically, extending the classical instrumental variables (IV) model with an endogenous binary treatment and a binary instrument, we allow the heteroskedasticity of the error disturbance to also depend upon the treatment variable so that treatment has both mean and variance effects on the outcome. In this endogenous heteroskedasticity IV (EHIV) model with heterogeneous individual treatment effects, the standard IV estimator can be inconsistent and lead to incorrect inference. After showing identification of the mean and variance treatment effects in a nonparametric version of the EHIV model, we provide closed-form estimators for the linear EHIV for the mean and variance treatment effects and the individual treatment effects (ITE). Asymptotic properties of the estimators are provided. A Monte Carlo simulation investigates the performance of the proposed approach, and an empirical application regarding the effects of fertility on female labor supply is considered.

[15]  arXiv:1803.06571 [pdf, ps, other]
Title: Orthogonal Representations for Output System Pairs
Comments: Work done in 200. Minor Revision 2001
Subjects: Methodology (stat.ME)

A new class of canonical forms is given proposed in which $(A, C)$ is in Hessenberg observer or Schur form and output normal: $\bf{I} - A^*A =C^*C$. Here, $C$ is the $d \times n$ measurement matrix and $A$ is the advance matrix. The $(C, A)$ stack is expressed as the product of $n$ orthogonal matrices, each of which depends on $d$ parameters. State updates require only ${\cal O}(nd)$ operations and derivatives of the system with respect to the parameters are fast and convenient to compute. Restrictions are given such that these models are generically identifiable. Since the observability Grammian is the identity matrix, system identification is better conditioned than other classes of models with fast updates.

[16]  arXiv:1803.06578 [pdf, other]
Title: A two-stage estimation procedure for non-linear structural equation models
Subjects: Methodology (stat.ME)

Applications of structural equation models (SEMs) are often restricted to linear associations between variables. Maximum likelihood (ML) estimation in non-linear models may be complex and require numerical integration. Furthermore, ML inference is sensitive to distributional assumptions. In this paper we introduce a simple two-stage estimation technique for estimation of non-linear associations between latent variables. Here both steps are based on fitting linear SEMs: first a linear model is fitted to data on the latent predictor and terms describing the non-linear effect are predicted by their conditional means. In the second step, the predictions are included in a linear model for the latent outcome variable. We show that this procedure is consistent and identifies its asymptotic distribution. We also illustrate how this framework easily allows the association between latent variables to be modeled using restricted cubic splines and we develop a modified estimator which is robust to non-normality of the latent predictor. In a simulation study we compare the proposed method to ML-analysis and a simple two-stage least squares technique.

[17]  arXiv:1803.06620 [pdf, ps, other]
Title: Characterizations of the Logistic and Related Distributions
Comments: 17 pages, Journal of Mathematical Analysis and Applications (2018)
Subjects: Statistics Theory (math.ST); Probability (math.PR)

It is known that few characterization results of the logistic distribution were available before, although it is similar in shape to the normal one whose characteristic properties have been well investigated. Fortunately, in the last decade, several authors have made great progress in this topic. Some interesting characterization results of the logistic distribution have been developed recently. In this paper, we further provide some new results by the distributional equalities in terms of order statistics of the underlying distribution and the random exponential shifts. The characterization of the closely related Pareto type II distribution is also investigated.

[18]  arXiv:1803.06645 [pdf, other]
Title: Approximating the Likelihood in Approximate Bayesian Computation
Subjects: Computation (stat.CO); Methodology (stat.ME)

This chapter will appear in the forthcoming Handbook of Approximate Bayesian Computation (2018).
The conceptual and methodological framework that underpins approximate Bayesian computation (ABC) is targetted primarily towards problems in which the likelihood is either challenging or missing. ABC uses a simulation-based non-parametric estimate of the likelihood of a summary statistic and assumes that the generation of data from the model is computationally cheap. This chapter reviews two alternative approaches for estimating the intractable likelihood, with the goal of reducing the necessary model simulations to produce an approximate posterior. The first of these is a Bayesian version of the synthetic likelihood (SL), initially developed by Wood (2010), which uses a multivariate normal approximation to the summary statistic likelihood. Using the parametric approximation as opposed to the non-parametric approximation of ABC, it is possible to reduce the number of model simulations required. The second likelihood approximation method we consider in this chapter is based on the empirical likelihood (EL), which is a non-parametric technique and involves maximising a likelihood constructed empirically under a set of moment constraints. Mengersen et al (2013) adapt the EL framework so that it can be used to form an approximate posterior for problems where ABC can be applied, that is, for models with intractable likelihoods. However, unlike ABC and the Bayesian SL (BSL), the Bayesian EL (BCel) approach can be used to completely avoid model simulations in some cases. The BSL and BCel methods are illustrated on models of varying complexity.

[19]  arXiv:1803.06669 [pdf, other]
Title: Testing for equal correlation matrices with application to paired gene expression data
Comments: 32 pages, 3 figures
Subjects: Methodology (stat.ME)

We present a novel method for testing the hypothesis of equality of two correlation matrices using paired high-dimensional datasets. We consider test statistics based on the average of squares, maximum and sum of exceedances of Fisher transform sample correlations and we derive approximate null distributions using asymptotic and non-parametric distributions. Theoretical results on the power of the tests are presented and backed up by a range of simulation experiments. We apply the methodology to a case study of colorectal tumour gene expression data with the aim of discovering biological pathway lists of genes that present significantly different correlation matrices on healthy and tumour samples. We find strong evidence for a large part of the pathway lists correlation matrices to change among the two medical conditions.

[20]  arXiv:1803.06673 [pdf, ps, other]
Title: Damped Anderson acceleration with restarts and monotonicity control for accelerating EM and EM-like algorithms
Subjects: Computation (stat.CO); Methodology (stat.ME)

The expectation-maximization (EM) algorithm is a well-known iterative method for computing maximum likelihood estimates from incomplete data. Despite its numerous advantages, a main drawback of the EM algorithm is its frequently observed slow convergence which often hinders the application of EM algorithms in high-dimensional problems or in other complex settings.To address the need for more rapidly convergent EM algorithms, we describe a new class of acceleration schemes that build on the Anderson acceleration technique for speeding fixed-point iterations. Our approach is effective at greatly accelerating the convergence of EM algorithms and is automatically scalable to high dimensional settings. Through the introduction of periodic algorithm restarts and a damping factor, our acceleration scheme provides faster and more robust convergence when compared to un-modified Anderson acceleration while also improving global convergence. Crucially, our method works as an "off-the-shelf" method in that it may be directly used to accelerate any EM algorithm without relying on the use of any model-specific features or insights. Through a series of simulation studies involving five representative problems, we show that our algorithm is substantially faster than the existing state-of-art acceleration schemes.

[21]  arXiv:1803.06675 [pdf, other]
Title: Rare Feature Selection in High Dimensions
Comments: 32 pages, 9 figures
Subjects: Methodology (stat.ME); Statistics Theory (math.ST); Computation (stat.CO); Machine Learning (stat.ML)

It is common in modern prediction problems for many predictor variables to be counts of rarely occurring events. This leads to design matrices in which many columns are highly sparse. The challenge posed by such "rare features" has received little attention despite its prevalence in diverse areas, ranging from natural language processing (e.g., rare words) to biology (e.g., rare species). We show, both theoretically and empirically, that not explicitly accounting for the rareness of features can greatly reduce the effectiveness of an analysis. We next propose a framework for aggregating rare features into denser features in a flexible manner that creates better predictors of the response. Our strategy leverages side information in the form of a tree that encodes feature similarity.
We apply our method to data from TripAdvisor, in which we predict the numerical rating of a hotel based on the text of the associated review. Our method achieves high accuracy by making effective use of rare words; by contrast, the lasso is unable to identify highly predictive words if they are too rare. A companion R package, called rare, implements our new estimator, using the alternating direction method of multipliers.

[22]  arXiv:1803.06711 [pdf, other]
Title: A Dynamic Additive and Multiplicative Effects Model with Application to the United Nations Voting Behaviors
Subjects: Applications (stat.AP)

In this paper, we introduce a statistical regression model for discrete-time networks that are correlated over time. Our model is a dynamic version of a Gaussian additive and multiplicative effects (DAME) model which extends the latent factor network model of Hoff (2009) and the additive and multiplicative effects model of Minhas et al. (2016a), by incorporating the temporal correlation structure into the prior specifications of the parameters. The temporal evolution of the network is modeled through a Gaussian process (GP) as in Durante and Dunson (2013), where we estimate the unknown covariance structure from the dataset. We analyze the United Nations General Assembly voting data from 1983 to 2014 (Voeten et al., 2016) and show the effectiveness of our model at inferring the dyadic dependence structure among the international voting behaviors as well as allowing for a varying number of nodes over time. Overall, the DAME model shows significantly better fit to the dataset compared to alternative approaches. Moreover, after controlling for other dyadic covariates such as geographic distances and bilateral trade between countries, the model-estimated additive effects, multiplicative effects, and their movements reveal interesting and meaningful foreign policy positions and alliances of various countries.

[23]  arXiv:1803.06716 [pdf, ps, other]
Title: High Dimensional Linear Regression using Lattice Basis Reduction
Subjects: Statistics Theory (math.ST); Probability (math.PR); Machine Learning (stat.ML)

We consider a high dimensional linear regression problem where the goal is to efficiently recover an unknown vector $\beta^*$ from $n$ noisy linear observations $Y=X\beta^*+W \in \mathbb{R}^n$, for known $X \in \mathbb{R}^{n \times p}$ and unknown $W \in \mathbb{R}^n$. Unlike most of the literature on this model we make no sparsity assumption on $\beta^*$. Instead we adopt a regularization based on assuming that the underlying vectors $\beta^*$ have rational entries with the same denominator $Q \in \mathbb{Z}_{>0}$. We call this $Q$-rationality assumption.
We propose a new polynomial-time algorithm for this task which is based on the seminal Lenstra-Lenstra-Lovasz (LLL) lattice basis reduction algorithm. We establish that under the $Q$-rationality assumption, our algorithm recovers exactly the vector $\beta^*$ for a large class of distributions for the iid entries of $X$ and non-zero noise $W$. We prove that it is successful under small noise, even when the learner has access to only one observation ($n=1$). Furthermore, we prove that in the case of the Gaussian white noise for $W$, $n=o\left(p/\log p\right)$ and $Q$ sufficiently large, our algorithm tolerates a nearly optimal information-theoretic level of the noise.

[24]  arXiv:1803.06730 [pdf, other]
Title: Combining Probabilistic Load Forecasts
Comments: Submitted to IEEE Transactions on Smart Grid
Subjects: Applications (stat.AP)

Probabilistic load forecasts provide comprehensive information about future load uncertainties. In recent years, many methodologies and techniques have been proposed for probabilistic load forecasting. Forecast combination, a widely recognized best practice in point forecasting literature, has never been formally adopted to combine probabilistic load forecasts. This paper proposes a constrained quantile regression averaging (CQRA) method to create an improved ensemble from several individual probabilistic forecasts. We formulate the CQRA parameter estimation problem as a linear program with the objective of minimizing the pinball loss, with the constraints that the parameters are nonnegative and summing up to one. We demonstrate the effectiveness of the proposed method using two publicly available datasets, the ISO New England data and Irish smart meter data. Comparing with the best individual probabilistic forecast, the ensemble can reduce the pinball score by 4.39% on average. The proposed ensemble also demonstrates superior performance over nine other benchmark ensembles.

[25]  arXiv:1803.06732 [pdf, ps, other]
Title: A class of asymmetric regression models for left-censored data
Comments: 18 pages, 2 figures
Subjects: Methodology (stat.ME)

A common assumption regarding the standard tobit model is the normality of the error distribution. However, asymmetry and bimodality may be present and alternative tobit models must be used. In this paper, we propose a tobit model based on the class of log-symmetric distributions, which includes as special cases heavy and light tailed distributions and bimodal distributions. We implement a likelihood-based approach for parameter estimation and derive a type of residual. We then discuss the problem of performing testing inference in the proposed class by using the likelihood ratio and gradient statistics, which are particularly convenient for tobit models, as they do not require the information matrix. A thorough Monte Carlo study is presented to evaluate the performance of the maximum likelihood estimators and the likelihood ratio and gradient tests. Finally, we illustrate the proposed methodology by using a real-world data set.

[26]  arXiv:1803.06735 [pdf, other]
Title: Bayesian ROC surface estimation under verification bias
Subjects: Applications (stat.AP); Methodology (stat.ME)

The Receiver Operating Characteristic (ROC) surface is a generalization of ROC curve and is widely used for assessment of the accuracy of diagnostic tests on three categories. A complication called the verification bias, meaning that not all subjects have their true disease status verified often occur in real application of ROC analysis. This is a common problem since the gold standard test, which is used to generate true disease status, can be invasive and expensive. In this paper, we will propose a Bayesian approach for estimating the ROC surface based on continuous data under a semi-parametric trinormality assumption. Our proposed method often adopted in ROC analysis can also be extended to situation in the presence of verification bias. We compute the posterior distribution of the parameters under trinormality assumption by using a rank-based likelihood. Consistency of the posterior under mild conditions is also established. We compare our method with the existing methods for estimating ROC surface and conclude that our method performs well in terms of accuracy.

[27]  arXiv:1803.06738 [pdf, other]
Title: Large-Scale Dynamic Predictive Regressions
Subjects: Methodology (stat.ME); Econometrics (econ.EM); Statistical Finance (q-fin.ST)

We develop a novel "decouple-recouple" dynamic predictive strategy and contribute to the literature on forecasting and economic decision making in a data-rich environment. Under this framework, clusters of predictors generate different latent states in the form of predictive densities that are later synthesized within an implied time-varying latent factor model. As a result, the latent inter-dependencies across predictive densities and biases are sequentially learned and corrected. Unlike sparse modeling and variable selection procedures, we do not assume a priori that there is a given subset of active predictors, which characterize the predictive density of a quantity of interest. We test our procedure by investigating the predictive content of a large set of financial ratios and macroeconomic variables on both the equity premium across different industries and the inflation rate in the U.S., two contexts of topical interest in finance and macroeconomics. We find that our predictive synthesis framework generates both statistically and economically significant out-of-sample benefits while maintaining interpretability of the forecasting variables. In addition, the main empirical results highlight that our proposed framework outperforms both LASSO-type shrinkage regressions, factor based dimension reduction, sequential variable selection, and equal-weighted linear pooling methodologies.

[28]  arXiv:1803.06763 [pdf, other]
Title: STatistical Election to Partition Sequentially (STEPS) and Its Application in Differentially Private Release and Analysis of Youth Voter Registration Data
Comments: 22 pages, 2 figures
Subjects: Applications (stat.AP)

Voter data is important in political science research and applications such as improving youth voter turnout. Privacy protection is imperative in voter data since it often contains sensitive individual information. Differential privacy (DP) formalizes privacy in probabilistic terms and provides a robust concept for privacy protection. DIfferentially Private Data Synthesis (DIPS) techniques produce synthetic data in the DP setting. However, statistical efficiency of the synthetic data via DIPS can be low due to the potentially large amount of noise injected to satisfy DP, especially in high-dimensional data. We propose a new DIPS approach STatistical Election to Partition Sequentially (STEPS) that sequentially partitions data by attributes per their differentiability of the data variability. Additionally, we propose a metric SPECKS that effectively assesses the similarity of synthetic data to the actual data. The application of the STEPS procedure on the 2000-2012 Current Population Survey youth voter data suggests STEPS is easy to implement and better preserves the original information than some DIPS approaches including the Laplace mechanism on the full cross-tabulation of the data and the hierarchical histograms generated via random partitioning.

[29]  arXiv:1803.06790 [pdf, other]
Title: Towards "simultaneous selective inference": post-hoc bounds on the false discovery proportion
Subjects: Statistics Theory (math.ST)

Some pitfalls of the false discovery rate (FDR) as an error criterion for multiple testing of $n$ hypotheses include (a) committing to an error level $q$ in advance limits its use in exploratory data analysis, and (b) controlling the false discovery proportion (FDP) on average provides no guarantee on its variability. We take a step towards overcoming these barriers using a new perspective we call "simultaneous selective inference." Many FDR procedures (such as Benjamini-Hochberg) can be viewed as carving out a $\textit{path}$ of potential rejection sets $\varnothing = \mathcal R_0 \subseteq \mathcal R_1 \subseteq \cdots \subseteq \mathcal R_n \subseteq [n]$, assigning some algorithm-dependent estimate $\widehat{\text{FDP}}(\mathcal R_k)$ to each one. Then, they choose $k^* = \max\{k: \widehat{\text{FDP}}(\mathcal R_k) \leq q\}$. We prove that for all these algorithms, given independent null p-values and a confidence level $\alpha$, either the same $\widehat{FDP}$ or a minor variant thereof bounds the unknown FDP to within a small explicit (algorithm-dependent) constant factor $c_{\text{alg}}(\alpha)$, uniformly across the entire path, with probability $1-\alpha$. Our bounds open up a middle ground between fully simultaneous inference (guarantees for all $2^n$ possible rejection sets), and fully selective inference (guarantees only for $\mathcal R_{k^*}$). They allow the scientist to $\textit{spot}$ one or more suitable rejection sets (Select Post-hoc On the algorithm's Trajectory), by picking data-dependent sizes or error-levels, after examining the entire path of $\widehat{\text{FDP}}(\mathcal R_k)$ and the uniform upper band on $\text{FDP}$. The price for the additional flexibility of spotting is small, for example the multiplier for BH corresponding to 95% confidence is approximately 2.

[30]  arXiv:1803.06823 [pdf, ps, other]
Title: Nonparametric forecasting of multivariate probability density functions
Subjects: Methodology (stat.ME)

The study of dependence between random variables is the core of theoretical and applied statistics. Static and dynamic copula models are useful for describing the dependence structure, which is fully encrypted in the copula probability density function. However, these models are not always able to describe the temporal change of the dependence patterns, which is a key characteristic of financial data. We propose a novel nonparametric framework for modelling a time series of copula probability density functions, which allows to forecast the entire function without the need of post-processing procedures to grant positiveness and unit integral. We exploit a suitable isometry that allows to transfer the analysis in a subset of the space of square integrable functions, where we build on nonparametric functional data analysis techniques to perform the analysis. The framework does not assume the densities to belong to any parametric family and it can be successfully applied also to general multivariate probability density functions with bounded or unbounded support. Finally, a noteworthy field of application pertains the study of time varying networks represented through vine copula models. We apply the proposed methodology for estimating and forecasting the time varying dependence structure between the S\&P500 and NASDAQ indices.

[31]  arXiv:1803.06852 [pdf, ps, other]
Title: Confounder Detection in High Dimensional Linear Models using First Moments of Spectral Measures
Subjects: Machine Learning (stat.ML); Learning (cs.LG)

In this paper, we study the confounder detection problem in the linear model, where the target variable $Y$ is predicted using its $n$ potential causes $X_n=(x_1,...,x_n)^T$. Based on an assumption of rotation invariant generating process of the model, recent study shows that the spectral measure induced by the regression coefficient vector with respect to the covariance matrix of $X_n$ is close to a uniform measure in purely causal cases, but it differs from a uniform measure characteristically in the presence of a scalar confounder. Then, analyzing spectral measure pattern could help to detect confounding. In this paper, we propose to use the first moment of the spectral measure for confounder detection. We calculate the first moment of the regression vector induced spectral measure, and compare it with the first moment of a uniform spectral measure, both defined with respect to the covariance matrix of $X_n$. The two moments coincide in non-confounding cases, and differ from each other in the presence of confounding. This statistical causal-confounding asymmetry can be used for confounder detection. Without the need of analyzing the spectral measure pattern, our method does avoid the difficulty of metric choice and multiple parameter optimization. Experiments on synthetic and real data show the performance of this method.

[32]  arXiv:1803.06907 [pdf, ps, other]
Title: Auxiliary information : the raking-ratio empirical process
Comments: 45 pages
Subjects: Statistics Theory (math.ST)

We study the empirical measure associated to a sample of size $n$ and modified by $N$ iterations of the raking-ratio method. The empirical measure is adjusted to match the true probability of sets in a finite partition which changes each step. We establish asymptotic properties of the raking-ratio empirical process indexed by functions as $n\rightarrow +\infty$, for $N$ fixed. A closed-form expression of the limiting covariance matrices is derived as $N\rightarrow +\infty$. The nonasymptotic Gaussian approximation we use also yields uniform Berry-Esseen type bounds in $n, N$ and sharp estimates of the uniform quadratic risk reduction. In the two-way contingency table formulas characterizing the limiting process are very simple.

[33]  arXiv:1803.06959 [pdf, other]
Title: On the importance of single directions for generalization
Comments: ICLR 2018 conference paper
Subjects: Machine Learning (stat.ML); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

Despite their ability to memorize large datasets, deep neural networks often achieve good generalization performance. However, the differences between the learned solutions of networks which generalize and those which do not remain unclear. Additionally, the tuning properties of single directions (defined as the activation of a single unit or some linear combination of units in response to some input) have been highlighted, but their importance has not been evaluated. Here, we connect these lines of inquiry to demonstrate that a network's reliance on single directions is a good predictor of its generalization performance, across networks trained on datasets with different fractions of corrupted labels, across ensembles of networks trained on datasets with unmodified labels, across different hyperparameters, and over the course of training. While dropout only regularizes this quantity up to a point, batch normalization implicitly discourages single direction reliance, in part by decreasing the class selectivity of individual units. Finally, we find that class selectivity is a poor predictor of task importance, suggesting not only that networks which generalize well minimize their dependence on individual units by reducing their selectivity, but also that individually selective units may not be necessary for strong network performance.

[34]  arXiv:1803.06964 [pdf, other]
Title: A modern maximum-likelihood theory for high-dimensional logistic regression
Comments: 25 pages, 12 figures, 4 tables
Subjects: Statistics Theory (math.ST); Methodology (stat.ME)

Every student in statistics or data science learns early on that when the sample size largely exceeds the number of variables, fitting a logistic model produces estimates that are approximately unbiased. Every student also learns that there are formulas to predict the variability of these estimates which are used for the purpose of statistical inference; for instance, to produce p-values for testing the significance of regression coefficients. Although these formulas come from large sample asymptotics, we are often told that we are on reasonably safe grounds when $n$ is large in such a way that $n \ge 5p$ or $n \ge 10p$. This paper shows that this is absolutely not the case. Consequently, inferences routinely produced by common software packages are unreliable.
Consider a logistic model with independent features in which $n$ and $p$ become increasingly large in a fixed ratio. Then we show that (1) the MLE is biased, (2) the variability of the MLE is far greater than classically predicted, and (3) the commonly used likelihood-ratio test (LRT) is not distributed as a chi-square. The bias of the MLE is extremely problematic as it yields completely wrong predictions for the probability of a case based on observed values of the covariates. We develop a new theory, which asymptotically predicts (1) the bias of the MLE, (2) the variability of the MLE, and (3) the distribution of the LRT. We empirically also demonstrate that these predictions are extremely accurate in finite samples. Further, an appealing feature is that these novel predictions depend on the unknown sequence of regression coefficients only through a single scalar, the overall strength of the signal. This suggests very concrete procedures to adjust inference; we describe one such procedure learning a single parameter from data and producing accurate inference.

[35]  arXiv:1803.06969 [pdf, ps, other]
Title: Comparing Dynamics: Deep Neural Networks versus Glassy Systems
Comments: 5 figures
Subjects: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Learning (cs.LG)

We analyze numerically the training dynamics of deep neural networks (DNN) by using methods developed in statistical physics of glassy systems. The two main issues we address are the complexity of the loss-landscape and of the dynamics within it, and to what extent DNNs share similarities with glassy systems. Our findings, obtained for different architectures and datasets, suggest that during the training process the dynamics slows down because of an increasingly large number of flat directions. At large times, when the loss is approaching zero, the system diffuses at the bottom of the landscape. Despite some similarities with the dynamics of mean-field glassy systems, in particular, the absence of barrier crossing, we find distinctive dynamical behaviors in the two cases, showing that the statistical properties of the corresponding loss and energy landscapes are different. In contrast, when the network is under-parametrized we observe a typical glassy behavior, thus suggesting the existence of different phases depending on whether the network is under-parametrized or over-parametrized.

[36]  arXiv:1803.06971 [pdf, other]
Title: What Doubling Tricks Can and Can't Do for Multi-Armed Bandits
Authors: Lilian Besson (IETR), Emilie Kaufmann (SEQUEL, CNRS)
Subjects: Machine Learning (stat.ML); Learning (cs.LG); Statistics Theory (math.ST)

An online reinforcement learning algorithm is anytime if it does not need to know in advance the horizon T of the experiment. A well-known technique to obtain an anytime algorithm from any non-anytime algorithm is the "Doubling Trick". In the context of adversarial or stochastic multi-armed bandits, the performance of an algorithm is measured by its regret, and we study two families of sequences of growing horizons (geometric and exponential) to generalize previously known results that certain doubling tricks can be used to conserve certain regret bounds. In a broad setting, we prove that a geometric doubling trick can be used to conserve (minimax) bounds in $R\_T = O(\sqrt{T})$ but cannot conserve (distribution-dependent) bounds in $R\_T = O(\log T)$. We give insights as to why exponential doubling tricks may be better, as they conserve bounds in $R\_T = O(\log T)$, and are close to conserving bounds in $R\_T = O(\sqrt{T})$.

[37]  arXiv:1803.06989 [pdf, other]
Title: Numerical Integration on Graphs: where to sample and how to weigh
Subjects: Statistics Theory (math.ST); Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)

Let $G=(V,E,w)$ be a finite, connected graph with weighted edges. We are interested in the problem of finding a subset $W \subset V$ of vertices and weights $a_w$ such that $$ \frac{1}{|V|}\sum_{v \in V}^{}{f(v)} \sim \sum_{w \in W}{a_w f(w)}$$ for functions $f:V \rightarrow \mathbb{R}$ that are `smooth' with respect to the geometry of the graph. The main application are problems where $f$ is known to somehow depend on the underlying graph but is expensive to evaluate on even a single vertex. We prove an inequality showing that the integration problem can be rewritten as a geometric problem (`the optimal packing of heat balls'). We discuss how one would construct approximate solutions of the heat ball packing problem; numerical examples demonstrate the efficiency of the method.

[38]  arXiv:1803.06992 [pdf]
Title: Estimating the intrinsic dimension of datasets by a minimal neighborhood information
Comments: Scientific Reports 2017
Subjects: Machine Learning (stat.ML); Learning (cs.LG)

Analyzing large volumes of high-dimensional data is an issue of fundamental importance in data science, molecular simulations and beyond. Several approaches work on the assumption that the important content of a dataset belongs to a manifold whose Intrinsic Dimension (ID) is much lower than the crude large number of coordinates. Such manifold is generally twisted and curved, in addition points on it will be non-uniformly distributed: two factors that make the identification of the ID and its exploitation really hard. Here we propose a new ID estimator using only the distance of the first and the second nearest neighbor of each point in the sample. This extreme minimality enables us to reduce the effects of curvature, of density variation, and the resulting computational cost. The ID estimator is theoretically exact in uniformly distributed datasets, and provides consistent measures in general. When used in combination with block analysis, it allows discriminating the relevant dimensions as a function of the block size. This allows estimating the ID even when the data lie on a manifold perturbed by a high-dimensional noise, a situation often encountered in real world data sets. We demonstrate the usefulness of the approach on molecular simulations and image analysis.

[39]  arXiv:1803.07018 [pdf, other]
Title: Bayesian design of experiments for intractable likelihood models using coupled auxiliary models and multivariate emulation
Subjects: Methodology (stat.ME)

A Bayesian design is given by maximising the expected utility over the design space. The utility is chosen to represent the aim of the experiment and its expectation is taken with respect to all unknowns: responses, parameters and/or models. Although straightforward in principle, there are several challenges to finding Bayesian designs in practice. Firstly, the expected utility is rarely available in closed form and requires approximation. Secondly, the expected utility needs to be maximised over a, potentially, high-dimensional design space. In the case of intractable likelihood models, these problems are compounded by the fact that the likelihood function, whose evaluation is required to approximate the expected utility, is not available in closed form. A strategy is proposed to find Bayesian designs for intractable likelihood models. It relies on the development of new methodology involving auxiliary modelling to approximate the expected utility, under an intractable likelihood model, applied with the latest approaches to maximising approximated expected utilities.

[40]  arXiv:1803.07054 [pdf, ps, other]
Title: Optimal link prediction with matrix logistic regression
Subjects: Statistics Theory (math.ST); Machine Learning (stat.ML)

We consider the problem of link prediction, based on partial observation of a large network, and on side information associated to its vertices. The generative model is formulated as a matrix logistic regression. The performance of the model is analysed in a high-dimensional regime under a structural assumption. The minimax rate for the Frobenius-norm risk is established and a combinatorial estimator based on the penalised maximum likelihood approach is shown to achieve it. Furthermore, it is shown that this rate cannot be attained by any (randomised) algorithm computable in polynomial time under a computational complexity assumption.

[41]  arXiv:1803.07069 [pdf, ps, other]
Title: Testing normality via a distributional fixed point property in the Stein characterization
Subjects: Methodology (stat.ME)

We propose two families of tests for the classical goodness-of-fit problem to univariate normality. The new procedures are based on $L^2$-distances of the empirical zero-bias transformation to the normal distribution or the empirical distribution of the data, respectively. Weak convergence results are derived under the null hypothesis, under fixed alternatives as well as under contiguous alternatives. Empirical critical values are provided and a comparative finite-sample power study shows the competitiveness to classical procedures.

Cross-lists for Tue, 20 Mar 18

[42]  arXiv:1803.05999 (cross-list from cs.LG) [pdf, other]
Title: Escaping Saddles with Stochastic Gradients
Subjects: Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)

We analyze the variance of stochastic gradients along negative curvature directions in certain non-convex machine learning models and show that stochastic gradients exhibit a strong component along these directions. Furthermore, we show that - contrary to the case of isotropic noise - this variance is proportional to the magnitude of the corresponding eigenvalues and not decreasing in the dimensionality. Based upon this observation we propose a new assumption under which we show that the injection of explicit, isotropic noise usually applied to make gradient descent escape saddle points can successfully be replaced by a simple SGD step. Additionally - and under the same condition - we derive the first convergence rate for plain SGD to a second-order stationary point in a number of iterations that is independent of the problem dimension.

[43]  arXiv:1803.06373 (cross-list from cs.LG) [pdf, ps, other]
Title: Adversarial Logit Pairing
Comments: 10 pages
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

In this paper, we develop improved techniques for defending against adversarial examples at scale. First, we implement the state of the art version of adversarial training at unprecedented scale on ImageNet and investigate whether it remains effective in this setting - an important open scientific question (Athalye et al., 2018). Next, we introduce enhanced defenses using a technique we call logit pairing, a method that encourages logits for pairs of examples to be similar. When applied to clean examples and their adversarial counterparts, logit pairing improves accuracy on adversarial examples over vanilla adversarial training; we also find that logit pairing on clean examples only is competitive with adversarial training in terms of accuracy on two datasets. Finally, we show that adversarial logit pairing achieves the state of the art defense on ImageNet against PGD white box attacks, with an accuracy improvement from 1.5% to 27.9%. Adversarial logit pairing also successfully damages the current state of the art defense against black box attacks on ImageNet (Tramer et al., 2018), dropping its accuracy from 66.6% to 47.1%. With this new accuracy drop, adversarial logit pairing ties with Tramer et al.(2018) for the state of the art on black box attacks on ImageNet.

[44]  arXiv:1803.06386 (cross-list from cs.LG) [pdf]
Title: Forecasting Economics and Financial Time Series: ARIMA vs. LSTM
Comments: 19 pages, 2 figures, 1 diagram, 2 listings
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

Forecasting time series data is an important subject in economics, business, and finance. Traditionally, there are several techniques to effectively forecast the next lag of time series data such as univariate Autoregressive (AR), univariate Moving Average (MA), Simple Exponential Smoothing (SES), and more notably Autoregressive Integrated Moving Average (ARIMA) with its many variations. In particular, ARIMA model has demonstrated its outperformance in precision and accuracy of predicting the next lags of time series. With the recent advancement in computational power of computers and more importantly developing more advanced machine learning algorithms and approaches such as deep learning, new algorithms are developed to forecast time series data. The research question investigated in this article is that whether and how the newly developed deep learning-based algorithms for forecasting time series data, such as "Long Short-Term Memory (LSTM)", are superior to the traditional algorithms. The empirical studies conducted and reported in this article show that deep learning-based algorithms such as LSTM outperform traditional-based algorithms such as ARIMA model. More specifically, the average reduction in error rates obtained by LSTM is between 84 - 87 percent when compared to ARIMA indicating the superiority of LSTM to ARIMA. Furthermore, it was noticed that the number of training times, known as "epoch" in deep learning, has no effect on the performance of the trained forecast model and it exhibits a truly random behavior.

[45]  arXiv:1803.06396 (cross-list from cs.LG) [pdf, other]
Title: Reviving and Improving Recurrent Back-Propagation
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

In this paper, we revisit the recurrent back-propagation (RBP) algorithm, discuss the conditions under which it applies as well as how to satisfy them in deep neural networks. We show that RBP can be unstable and propose two variants based on conjugate gradient on the normal equations (CG-RBP) and Neumann series (Neumann-RBP). We further investigate the relationship between Neumann-RBP and back propagation through time (BPTT) and its truncated version (TBPTT). Our Neumann-RBP has the same time complexity as TBPTT but only requires constant memory, whereas TBPTT's memory cost scales linearly with the number of truncation steps. We examine all RBP variants along with BPTT and TBPTT in three different application domains: associative memory with continuous Hopfield networks, document classification in citation networks using graph neural networks and hyperparameter optimization for fully connected networks. All experiments demonstrate that RBPs, especially the Neumann-RBP variant, are efficient and effective for optimizing convergent recurrent neural networks.

[46]  arXiv:1803.06401 (cross-list from econ.EM) [pdf, ps, other]
Title: Evaluating Conditional Cash Transfer Policies with Machine Learning Methods
Authors: Tzai-Shuen Chen
Subjects: Econometrics (econ.EM); Machine Learning (stat.ML)

This paper presents an out-of-sample prediction comparison between major machine learning models and the structural econometric model. Over the past decade, machine learning has established itself as a powerful tool in many prediction applications, but this approach is still not widely adopted in empirical economic studies. To evaluate the benefits of this approach, I use the most common machine learning algorithms, CART, C4.5, LASSO, random forest, and adaboost, to construct prediction models for a cash transfer experiment conducted by the Progresa program in Mexico, and I compare the prediction results with those of a previous structural econometric study. Two prediction tasks are performed in this paper: the out-of-sample forecast and the long-term within-sample simulation. For the out-of-sample forecast, both the mean absolute error and the root mean square error of the school attendance rates found by all machine learning models are smaller than those found by the structural model. Random forest and adaboost have the highest accuracy for the individual outcomes of all subgroups. For the long-term within-sample simulation, the structural model has better performance than do all of the machine learning models. The poor within-sample fitness of the machine learning model results from the inaccuracy of the income and pregnancy prediction models. The result shows that the machine learning model performs better than does the structural model when there are many data to learn; however, when the data are limited, the structural model offers a more sensible prediction. The findings of this paper show promise for adopting machine learning in economic policy analyses in the era of big data.

[47]  arXiv:1803.06407 (cross-list from cs.LG) [pdf, other]
Title: Deep Component Analysis via Alternating Direction Neural Networks
Subjects: Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Despite a lack of theoretical understanding, deep neural networks have achieved unparalleled performance in a wide range of applications. On the other hand, shallow representation learning with component analysis is associated with rich intuition and theory, but smaller capacity often limits its usefulness. To bridge this gap, we introduce Deep Component Analysis (DeepCA), an expressive multilayer model formulation that enforces hierarchical structure through constraints on latent variables in each layer. For inference, we propose a differentiable optimization algorithm implemented using recurrent Alternating Direction Neural Networks (ADNNs) that enable parameter learning using standard backpropagation. By interpreting feed-forward networks as single-iteration approximations of inference in our model, we provide both a novel theoretical perspective for understanding them and a practical technique for constraining predictions with prior knowledge. Experimentally, we demonstrate performance improvements on a variety of tasks, including single-image depth prediction with sparse output constraints.

[48]  arXiv:1803.06441 (cross-list from eess.SP) [pdf, ps, other]
Title: A Novel Blaschke Unwinding Adaptive Fourier Decomposition based Signal Compression Algorithm with Application on ECG Signals
Subjects: Signal Processing (eess.SP); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)

This paper presents a novel signal compression algorithm based on the Blaschke unwinding adaptive Fourier decomposition (AFD). The Blaschke unwinding AFD is a newly developed signal decomposition theory. It utilizes the Nevanlinna factorization and the maximal selection principle in each decomposition step, and achieves a faster convergence rate with higher fidelity. The proposed compression algorithm is applied to the electrocardiogram signal. To assess the performance of the proposed compression algorithm, in addition to the generic assessment criteria, we consider the less discussed criteria related to the clinical needs -- for the heart rate variability analysis purpose, how accurate the R peak information is preserved is evaluated. The experiments are conducted on the MIT-BIH arrhythmia benchmark database. The results show that the proposed algorithm performs better than other state-of-the-art approaches. Meanwhile, it also well preserves the R peak information.

[49]  arXiv:1803.06443 (cross-list from cs.LG) [pdf, other]
Title: Decentralization Meets Quantization
Subjects: Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Systems and Control (cs.SY); Machine Learning (stat.ML)

Optimizing distributed learning systems is an art of balancing between computation and communication. There have been two lines of research that try to deal with slower networks: {\em quantization} for low bandwidth networks, and {\em decentralization} for high latency networks. In this paper, we explore a natural question: {\em can the combination of both decentralization and quantization lead to a system that is robust to both bandwidth and latency?}
Although the system implication of such combination is trivial, the underlying theoretical principle and algorithm design is challenging: simply quantizing data sent in a decentralized training algorithm would accumulate the error. In this paper, we develop a framework of quantized, decentralized training and propose two different strategies, which we call {\em extrapolation compression} and {\em difference compression}. We analyze both algorithms and prove both converge at the rate of $O(1/\sqrt{nT})$ where $n$ is the number of workers and $T$ is the number of iterations, matching the {\rc convergence} rate for full precision, centralized training. We evaluate our algorithms on training deep learning models, and find that our proposed algorithm outperforms the best of merely decentralized and merely quantized algorithm significantly for networks with {\em both} high latency and low bandwidth.

[50]  arXiv:1803.06449 (cross-list from physics.chem-ph) [pdf, other]
Title: Note: Variational Encoding of Protein Dynamics Benefits from Maximizing Latent Autocorrelation
Subjects: Chemical Physics (physics.chem-ph); Learning (cs.LG); Biological Physics (physics.bio-ph); Machine Learning (stat.ML)

As deep Variational Auto-Encoder (VAE) frameworks become more widely used for modeling biomolecular simulation data, we emphasize the capability of the VAE architecture to concurrently maximize the timescale of the latent space while inferring a reduced coordinate, which assists in finding slow processes as according to the variational approach to conformational dynamics. We additionally provide evidence that the VDE framework (Hern\'{a}ndez et al., 2017), which uses this autocorrelation loss along with a time-lagged reconstruction loss, obtains a variationally optimized latent coordinate in comparison with related loss functions. We thus recommend leveraging the autocorrelation of the latent space while training neural network models of biomolecular simulation data to better represent slow processes.

[51]  arXiv:1803.06453 (cross-list from cs.LG) [pdf, other]
Title: Constrained Deep Learning using Conditional Gradient and Applications in Computer Vision
Subjects: Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

A number of results have recently demonstrated the benefits of incorporating various constraints when training deep architectures in vision and machine learning. The advantages range from guarantees for statistical generalization to better accuracy to compression. But support for general constraints within widely used libraries remains scarce and their broader deployment within many applications that can benefit from them remains under-explored. Part of the reason is that Stochastic gradient descent (SGD), the workhorse for training deep neural networks, does not natively deal with constraints with global scope very well. In this paper, we revisit a classical first order scheme from numerical optimization, Conditional Gradients (CG), that has, thus far had limited applicability in training deep models. We show via rigorous analysis how various constraints can be naturally handled by modifications of this algorithm. We provide convergence guarantees and show a suite of immediate benefits that are possible -- from training ResNets with fewer layers but better accuracy simply by substituting in our version of CG to faster training of GANs with 50% fewer epochs in image inpainting applications to provably better generalization guarantees using efficiently implementable forms of recently proposed regularizers.

[52]  arXiv:1803.06460 (cross-list from q-fin.PM) [pdf, other]
Title: Mean Reverting Portfolios via Penalized OU-Likelihood Estimation
Comments: 7 pages, 6 figures
Subjects: Portfolio Management (q-fin.PM); Optimization and Control (math.OC); Machine Learning (stat.ML)

We study an optimization-based approach to con- struct a mean-reverting portfolio of assets. Our objectives are threefold: (1) design a portfolio that is well-represented by an Ornstein-Uhlenbeck process with parameters estimated by maximum likelihood, (2) select portfolios with desirable characteristics of high mean reversion and low variance, and (3) select a parsimonious portfolio, i.e. find a small subset of a larger universe of assets that can be used for long and short positions. We present the full problem formulation, a specialized algorithm that exploits partial minimization, and numerical examples using both simulated and empirical price data.

[53]  arXiv:1803.06521 (cross-list from cs.LG) [pdf, ps, other]
Title: Learning Mixtures of Product Distributions via Higher Multilinear Moments
Comments: 57 pages, comments welcome
Subjects: Learning (cs.LG); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)

Learning mixtures of $k$ binary product distributions is a central problem in computational learning theory, but one where there are wide gaps between the best known algorithms and lower bounds (even for restricted families of algorithms). We narrow many of these gaps by developing novel insights about how to reason about higher order multilinear moments. Our results include:
1) An $n^{O(k^2)}$ time algorithm for learning mixtures of binary product distributions, giving the first improvement on the $n^{O(k^3)}$ time algorithm of Feldman, O'Donnell and Servedio
2) An $n^{\Omega(\sqrt{k})}$ statistical query lower bound, improving on the $n^{\Omega(\log k)}$ lower bound that is based on connections to sparse parity with noise
3) An $n^{O(\log k)}$ time algorithm for learning mixtures of $k$ subcubes. This special case can still simulate many other hard learning problems, but is much richer than any of them alone. As a corollary, we obtain more flexible algorithms for learning decision trees under the uniform distribution, that work with stochastic transitions, when we are only given positive examples and with a polylogarithmic number of samples for any fixed $k$.
Our algorithms are based on a win-win analysis where we either build a basis for the moments or locate a degeneracy that can be used to simplify the problem, which we believe will have applications to other learning problems over discrete domains.

[54]  arXiv:1803.06531 (cross-list from cs.SY) [pdf, ps, other]
Title: Topology Estimation using Graphical Models in Multi-Phase Power Distribution Grids
Comments: 12 pages 9 figures
Subjects: Systems and Control (cs.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)

Distribution grid is the medium and low voltage part of a large power system. Structurally, the majority of distribution networks operate radially, such that energized lines form a collection of trees, i.e. forest, with a substation being at the root of any tree. The operational topology/forest may change from time to time, however tracking these changes, even though important for the distribution grid operation and control, is hindered by limited real-time monitoring. This paper develops a learning framework to reconstruct radial operational structure of the distribution grid from synchronized voltage measurements in the grid subject to the exogenous fluctuations in nodal power consumption. To detect operational lines our learning algorithm uses conditional independence tests for continuous random variables that is applicable to a wide class of probability distributions of the nodal consumption and Gaussian injections in particular. Moreover, our algorithm applies to the practical case of unbalanced three-phase power flow. Algorithm performance is validated on AC power flow simulations over IEEE distribution grid test cases.

[55]  arXiv:1803.06561 (cross-list from cs.LG) [pdf, other]
Title: Multi-device, Multi-tenant Model Selection with GP-EI
Subjects: Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)

Bayesian optimization is the core technique behind the emergence of AutoML, which holds the promise of automatically searching for models and hyperparameters to make machine learning techniques more accessible. As such services are moving towards the cloud, we ask -- {\em When multiple AutoML users share the same computational infrastructure, how should we allocate resources to maximize the "global happiness" of all users?}
We focus on GP-EI, one of the most popular algorithms for automatic model selection and hyperparameter tuning, and develop a novel multi-device, multi-tenant extension that is aware of \emph{multiple} computation devices and multiple users sharing the same set of computation devices. Theoretically, given $N$ users and $M$ devices, we obtain a regret bound of $O((\text{\bf {MIU}}(T,K) + M)\frac{N^2}{M})$, where $\text{\bf {MIU}}(T,K)$ refers to the maximal incremental uncertainty up to time $T$ for the covariance matrix $K$. Empirically, we evaluate our algorithm on two applications of automatic model selection, and show that our algorithm significantly outperforms the strategy of serving users independently. Moreover, when multiple computation devices are available, we achieve near-linear speedup when the number of users is much larger than the number of devices.

[56]  arXiv:1803.06567 (cross-list from cs.LG) [pdf, other]
Title: A Dual Approach to Scalable Verification of Deep Networks
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

This paper addresses the problem of formally verifying desirable properties of neural networks, i.e., obtaining provable guarantees that the outputs of the neural network will always behave in a certain way for a given class of inputs. Most previous work on this topic was limited in its applicability by the size of the network, network architecture and the complexity of properties to be verified. In contrast, our framework applies to much more general class of activation functions and specifications on neural network inputs and outputs. We formulate verification as an optimization problem and solve a Lagrangian relaxation of the optimization problem to obtain an upper bound on the verification objective. Our approach is anytime, i.e. it can be stopped at any time and a valid bound on the objective can be obtained. We develop specialized verification algorithms with provable tightness guarantees under special assumptions and demonstrate the practical significance of our general verification approach on a variety of verification tasks.

[57]  arXiv:1803.06585 (cross-list from cs.LG) [pdf, other]
Title: Learning Long Term Dependencies via Fourier Recurrent Units
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

It is a known fact that training recurrent neural networks for tasks that have long term dependencies is challenging. One of the main reasons is the vanishing or exploding gradient problem, which prevents gradient information from propagating to early layers. In this paper we propose a simple recurrent architecture, the Fourier Recurrent Unit (FRU), that stabilizes the gradients that arise in its training while giving us stronger expressive power. Specifically, FRU summarizes the hidden states $h^{(t)}$ along the temporal dimension with Fourier basis functions. This allows gradients to easily reach any layer due to FRU's residual learning structure and the global support of trigonometric functions. We show that FRU has gradient lower and upper bounds independent of temporal dimension. We also show the strong expressivity of sparse Fourier basis, from which FRU obtains its strong expressive power. Our experimental study also demonstrates that with fewer parameters the proposed architecture outperforms other recurrent architectures on many tasks.

[58]  arXiv:1803.06586 (cross-list from cs.LG) [pdf, other]
Title: Structural query-by-committee
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

In this work, we describe a framework that unifies many different interactive learning tasks. We present a generalization of the {\it query-by-committee} active learning algorithm for this setting, and we study its consistency and rate of convergence, both theoretically and empirically, with and without noise.

[59]  arXiv:1803.06589 (cross-list from cs.LG) [pdf]
Title: Early Hospital Mortality Prediction using Vital Signals
Comments: 8 page, 5 figures, Submitted to IEEE&ACM CHASE 2018
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

Early hospital mortality prediction is critical as intensivists strive to make efficient medical decisions about the severely ill patients staying in intensive care units. As a result, various methods have been developed to address this problem based on clinical records. However, some of the laboratory test results are time-consuming and need to be processed. In this paper, we propose a novel method to predict mortality using features extracted from the heart signals of patients within the first hour of ICU admission. In order to predict the risk, quantitative features have been computed based on the heart rate signals of ICU patients. Each signal is described in terms of 12 statistical and signal-based features. The extracted features are fed into eight classifiers: decision tree, linear discriminant, logistic regression, support vector machine (SVM), random forest, boosted trees, Gaussian SVM, and K-nearest neighborhood (K-NN). To derive insight into the performance of the proposed method, several experiments have been conducted using the well-known clinical dataset named Medical Information Mart for Intensive Care III (MIMIC-III). The experimental results demonstrate the capability of the proposed method in terms of precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC). The decision tree classifier satisfies both accuracy and interpretability better than the other classifiers, producing an F1-score and AUC equal to 0.91 and 0.93, respectively. It indicates that heart rate signals can be used for predicting mortality in patients in the ICU, achieving a comparable performance with existing predictions that rely on high dimensional features from clinical records which need to be processed and may contain missing information.

[60]  arXiv:1803.06604 (cross-list from cs.LG) [pdf, ps, other]
Title: A Robust AUC Maximization Framework with Simultaneous Outlier Detection and Feature Selection for Positive-Unlabeled Classification
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

The positive-unlabeled (PU) classification is a common scenario in real-world applications such as healthcare, text classification, and bioinformatics, in which we only observe a few samples labeled as "positive" together with a large volume of "unlabeled" samples that may contain both positive and negative samples. Building robust classifier for the PU problem is very challenging, especially for complex data where the negative samples overwhelm and mislabeled samples or corrupted features exist. To address these three issues, we propose a robust learning framework that unifies AUC maximization (a robust metric for biased labels), outlier detection (for excluding wrong labels), and feature selection (for excluding corrupted features). The generalization error bounds are provided for the proposed model that give valuable insight into the theoretical performance of the method and lead to useful practical guidance, e.g., to train a model, we find that the included unlabeled samples are sufficient as long as the sample size is comparable to the number of positive samples in the training process. Empirical comparisons and two real-world applications on surgical site infection (SSI) and EEG seizure detection are also conducted to show the effectiveness of the proposed model.

[61]  arXiv:1803.06727 (cross-list from cs.LG) [pdf, other]
Title: Aggregating Strategies for Long-term Forecasting
Comments: 20 pages, 4 figures
Subjects: Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST); Machine Learning (stat.ML)

The article is devoted to investigating the application of aggregating algorithms to the problem of the long-term forecasting. We examine the classic aggregating algorithms based on the exponential reweighing. For the general Vovk's aggregating algorithm we provide its generalization for the long-term forecasting. For the special basic case of Vovk's algorithm we provide its two modifications for the long-term forecasting. The first one is theoretically close to an optimal algorithm and is based on replication of independent copies. It provides the time-independent regret bound with respect to the best expert in the pool. The second one is not optimal but is more practical and has $O(\sqrt{T})$ regret bound, where $T$ is the length of the game.

[62]  arXiv:1803.06773 (cross-list from cs.LG) [pdf, other]
Title: Composable Deep Reinforcement Learning for Robotic Manipulation
Comments: Videos: this https URL
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)

Model-free deep reinforcement learning has been shown to exhibit good performance in domains ranging from video games to simulated robotic manipulation and locomotion. However, model-free methods are known to perform poorly when the interaction time with the environment is limited, as is the case for most real-world robotic tasks. In this paper, we study how maximum entropy policies trained using soft Q-learning can be applied to real-world robotic manipulation. The application of this method to real-world manipulation is facilitated by two important features of soft Q-learning. First, soft Q-learning can learn multimodal exploration strategies by learning policies represented by expressive energy-based models. Second, we show that policies learned with soft Q-learning can be composed to create new policies, and that the optimality of the resulting policy can be bounded in terms of the divergence between the composed policies. This compositionality provides an especially valuable tool for real-world manipulation, where constructing new policies by composing existing skills can provide a large gain in efficiency over training from scratch. Our experimental evaluation demonstrates that soft Q-learning is substantially more sample efficient than prior model-free deep reinforcement learning methods, and that compositionality can be performed for both simulated and real-world tasks.

[63]  arXiv:1803.06795 (cross-list from cs.CV) [pdf, other]
Title: Nonlocal Low-Rank Tensor Factor Analysis for Image Restoration
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Low-rank signal modeling has been widely leveraged to capture non-local correlation in image processing applications. We propose a new method that employs low-rank tensor factor analysis for tensors generated by grouped image patches. The low-rank tensors are fed into the alternative direction multiplier method (ADMM) to further improve image reconstruction. The motivating application is compressive sensing (CS), and a deep convolutional architecture is adopted to approximate the expensive matrix inversion in CS applications. An iterative algorithm based on this low-rank tensor factorization strategy, called NLR-TFA, is presented in detail. Experimental results on noiseless and noisy CS measurements demonstrate the superiority of the proposed approach, especially at low CS sampling rates.

[64]  arXiv:1803.06898 (cross-list from cs.CV) [pdf, other]
Title: A Mixture of Views Network with Applications to the Classification of Breast Microcalcifications
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG); Machine Learning (stat.ML)

In this paper we examine data fusion methods for multi-view data classification. We present a decision concept which explicitly takes into account the input multi-view structure, where for each case there is a different subset of relevant views. This data fusion concept, which we dub Mixture of Views, is implemented by a special purpose neural network architecture. It is demonstrated on the task of classifying breast microcalcifications as benign or malignant based on CC and MLO mammography views. The single view decisions are combined by a data-driven decision, according to the relevance of each view in a given case, into a global decision. The method is evaluated on a large multi-view dataset extracted from the standardized digital database for screening mammography (DDSM). The experimental results show that our method outperforms previously suggested fusion methods.

[65]  arXiv:1803.06905 (cross-list from cs.LG) [pdf, other]
Title: TBD: Benchmarking and Analyzing Deep Neural Network Training
Subjects: Learning (cs.LG); Machine Learning (stat.ML)

The recent popularity of deep neural networks (DNNs) has generated a lot of research interest in performing DNN-related computation efficiently. However, the primary focus is usually very narrow and limited to (i) inference -- i.e. how to efficiently execute already trained models and (ii) image classification networks as the primary benchmark for evaluation.
Our primary goal in this work is to break this myopic view by (i) proposing a new benchmark for DNN training, called TBD (TBD is short for Training Benchmark for DNNs), that uses a representative set of DNN models that cover a wide range of machine learning applications: image classification, machine translation, speech recognition, object detection, adversarial networks, reinforcement learning, and (ii) by performing an extensive performance analysis of training these different applications on three major deep learning frameworks (TensorFlow, MXNet, CNTK) across different hardware configurations (single-GPU, multi-GPU, and multi-machine). TBD currently covers six major application domains and eight different state-of-the-art models.
We present a new toolchain for performance analysis for these models that combines the targeted usage of existing performance analysis tools, careful selection of new and existing metrics and methodologies to analyze the results, and utilization of domain specific characteristics of DNN training. We also build a new set of tools for memory profiling in all three major frameworks; much needed tools that can finally shed some light on precisely how much memory is consumed by different data structures (weights, activations, gradients, workspace) in DNN training. By using our tools and methodologies, we make several important observations and recommendations on where the future research and optimization of DNN training should be focused.

[66]  arXiv:1803.06917 (cross-list from q-fin.ST) [pdf, other]
Title: Universal features of price formation in financial markets: perspectives from Deep Learning
Subjects: Statistical Finance (q-fin.ST); Trading and Market Microstructure (q-fin.TR); Machine Learning (stat.ML)

Using a large-scale Deep Learning approach applied to a high-frequency database containing billions of electronic market quotes and transactions for US equities, we uncover nonparametric evidence for the existence of a universal and stationary price formation mechanism relating the dynamics of supply and demand for a stock, as revealed through the order book, to subsequent variations in its market price. We assess the model by testing its out-of-sample predictions for the direction of price moves given the history of price and order flow, across a wide range of stocks and time periods. The universal price formation model is shown to exhibit a remarkably stable out-of-sample prediction accuracy across time, for a wide range of stocks from different sectors. Interestingly, these results also hold for stocks which are not part of the training sample, showing that the relations captured by the model are universal and not asset-specific.
The universal model --- trained on data from all stocks --- outperforms, in terms of out-of-sample prediction accuracy, asset-specific linear and nonlinear models trained on time series of any given stock, showing that the universal nature of price formation weighs in favour of pooling together financial data from various stocks, rather than designing asset- or sector-specific models as commonly done. Standard data normalizations based on volatility, price level or average spread, or partitioning the training data into sectors or categories such as large/small tick stocks, do not improve training results. On the other hand, inclusion of price and order flow history over many past observations is shown to improve forecasting performance, showing evidence of path-dependence in price dynamics.

[67]  arXiv:1803.06952 (cross-list from cs.LG) [pdf, other]
Title: Asymmetric kernel in Gaussian Processes for learning target variance
Comments: Accepted in Pattern Recognition Letters, 2018
Subjects: Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

This work incorporates the multi-modality of the data distribution into a Gaussian Process regression model. We approach the problem from a discriminative perspective by learning, jointly over the training data, the target space variance in the neighborhood of a certain sample through metric learning. We start by using data centers rather than all training samples. Subsequently, each center selects an individualized kernel metric. This enables each center to adjust the kernel space in its vicinity in correspondence with the topology of the targets --- a multi-modal approach. We additionally add descriptiveness by allowing each center to learn a precision matrix. We demonstrate empirically the reliability of the model.

[68]  arXiv:1803.06978 (cross-list from cs.CV) [pdf, other]
Title: Improving Transferability of Adversarial Examples with Input Diversity
Comments: Submitted to ECCV 2018, code available at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG); Machine Learning (stat.ML)

Though convolutional neural networks have achieved state-of-the-art performance on various vision tasks, they are extremely vulnerable to adversarial examples, which are obtained by adding human-imperceptible perturbations to the original images. Adversarial examples can thus be used as an useful tool to evaluate and select the most robust models in safety-critical applications. However, most of the existing adversarial attacks only achieve relatively low success rates under the challenging black-box setting, where the attackers have no knowledge of the model structure and parameters. To this end, we propose to improve the transferability of adversarial examples by creating diverse input patterns. Instead of only using the original images to generate adversarial examples, our method applies random transformations to the input images at each iteration. Extensive experiments on ImageNet show that the proposed attack method can generate adversarial examples that transfer much better to different networks than existing baselines. To further improve the transferability, we (1) integrate the recently proposed momentum method into the attack process; and (2) attack an ensemble of networks simultaneously. By evaluating our method against top defense submissions and official baselines from NIPS 2017 adversarial competition, this enhanced attack reaches an average success rate of 73.0%, which outperforms the top 1 attack submission in the NIPS competition by a large margin of 6.6%. We hope that our proposed attack strategy can serve as a benchmark for evaluating the robustness of networks to adversaries and the effectiveness of different defense methods in future. The code is public available at https://github.com/cihangxie/DI-2-FGSM.

[69]  arXiv:1803.07055 (cross-list from cs.LG) [pdf, other]
Title: Simple random search provides a competitive approach to reinforcement learning
Comments: 22 pages, 5 figures, 9 tables
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)

A common belief in model-free reinforcement learning is that methods based on random search in the parameter space of policies exhibit significantly worse sample complexity than those that explore the space of actions. We dispel such beliefs by introducing a random search method for training static, linear policies for continuous control problems, matching state-of-the-art sample efficiency on the benchmark MuJoCo locomotion tasks. Our method also finds a nearly optimal controller for a challenging instance of the Linear Quadratic Regulator, a classical problem in control theory, when the dynamics are not known. Computationally, our random search algorithm is at least 15 times more efficient than the fastest competing model-free methods on these benchmarks. We take advantage of this computational efficiency to evaluate the performance of our method over hundreds of random seeds and many different hyperparameter configurations for each benchmark task. Our simulations highlight a high variability in performance in these benchmark tasks, suggesting that commonly used estimations of sample efficiency do not adequately evaluate the performance of RL algorithms.

[70]  arXiv:1803.07067 (cross-list from cs.LG) [pdf, other]
Title: Setting up a Reinforcement Learning Task with a Real-World Robot
Comments: Submitted to 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)

Reinforcement learning is a promising approach to developing hard-to-engineer adaptive solutions for complex and diverse robotic tasks. However, learning with real-world robots is often unreliable and difficult, which resulted in their low adoption in reinforcement learning research. This difficulty is worsened by the lack of guidelines for setting up learning tasks with robots. In this work, we develop a learning task with a UR5 robotic arm to bring to light some key elements of a task setup and study their contributions to the challenges with robots. We find that learning performance can be highly sensitive to the setup, and thus oversights and omissions in setup details can make effective learning, reproducibility, and fair comparison hard. Our study suggests some mitigating steps to help future experimenters avoid difficulties and pitfalls. We show that highly reliable and repeatable experiments can be performed in our setup, indicating the possibility of reinforcement learning research extensively based on real-world robots.

[71]  arXiv:1803.07068 (cross-list from cs.DC) [pdf, other]
Title: D$^2$: Decentralized Training over Decentralized Data
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Learning (cs.LG); Machine Learning (stat.ML)

While training a machine learning model using multiple workers, each of which collects data from their own data sources, it would be most useful when the data collected from different workers can be {\em unique} and {\em different}. Ironically, recent analysis of decentralized parallel stochastic gradient descent (D-PSGD) relies on the assumption that the data hosted on different workers are {\em not too different}. In this paper, we ask the question: {\em Can we design a decentralized parallel stochastic gradient descent algorithm that is less sensitive to the data variance across workers?} In this paper, we present D$^2$, a novel decentralized parallel stochastic gradient descent algorithm designed for large data variance \xr{among workers} (imprecisely, "decentralized" data). The core of D$^2$ is a variance blackuction extension of the standard D-PSGD algorithm, which improves the convergence rate from $O\left({\sigma \over \sqrt{nT}} + {(n\zeta^2)^{\frac{1}{3}} \over T^{2/3}}\right)$ to $O\left({\sigma \over \sqrt{nT}}\right)$ where $\zeta^{2}$ denotes the variance among data on different workers. As a result, D$^2$ is robust to data variance among workers. We empirically evaluated D$^2$ on image classification tasks where each worker has access to only the data of a limited set of labels, and find that D$^2$ significantly outperforms D-PSGD.

Replacements for Tue, 20 Mar 18

[72]  arXiv:1505.03898 (replaced) [pdf, ps, other]
Title: Pinball Loss Minimization for One-bit Compressive Sensing: Convex Models and Algorithms
Comments: 11 pages
Subjects: Information Theory (cs.IT); Numerical Analysis (math.NA); Optimization and Control (math.OC); Machine Learning (stat.ML)
[73]  arXiv:1606.01200 (replaced) [pdf, ps, other]
Title: Simple and Honest Confidence Intervals in Nonparametric Regression
Subjects: Applications (stat.AP); Statistics Theory (math.ST)
[74]  arXiv:1608.02199 (replaced) [pdf, other]
Title: An EM algorithm for absolutely continuous Marshall-Olkin bivariate Pareto distribution with location and scale
Comments: 17 pages, 4 figures
Subjects: Computation (stat.CO); Statistics Theory (math.ST)
[75]  arXiv:1609.04608 (replaced) [pdf, other]
Title: Recursive nearest agglomeration (ReNA): fast clustering for approximation of structured signals
Authors: Andrés Hoyos-Idrobo (PARIETAL, NEUROSPIN), Gaël Varoquaux (PARIETAL, NEUROSPIN), Jonas Kahn, Bertrand Thirion (PARIETAL)
Comments: IEEE Transactions on Pattern Analysis and Machine Intelligence, Institute of Electrical and Electronics Engineers, In press
Subjects: Machine Learning (stat.ML); Learning (cs.LG)
[76]  arXiv:1609.06789 (replaced) [pdf, ps, other]
Title: Krigings Over Space and Time Based on Latent Low-Dimensional Structures
Comments: 35 pages, 2 figures
Subjects: Methodology (stat.ME)
[77]  arXiv:1704.02060 (replaced) [pdf, ps, other]
Title: Angle-Based Joint and Individual Variation Explained
Comments: arXiv admin note: text overlap with arXiv:1512.04060
Subjects: Machine Learning (stat.ML)
[78]  arXiv:1704.06762 (replaced) [pdf, ps, other]
Title: Estimation for multiplicative models under multinomial sampling
Authors: Antonio Forcina
Subjects: Statistics Theory (math.ST)
[79]  arXiv:1704.07531 (replaced) [pdf, other]
Title: Sufficient Markov Decision Processes with Alternating Deep Neural Networks
Comments: 31 pages, 3 figures, extended abstract in the proceedings of RLDM2017. (v2 revisions: Fixed a minor bug in the code w.r.t. setting seed, as a result numbers in the simulation experiments had some slight changes, but conclusions stayed the same. Corrected typos. Improved notations.)
Subjects: Methodology (stat.ME); Statistics Theory (math.ST); Machine Learning (stat.ML)
[80]  arXiv:1704.07645 (replaced) [pdf, other]
Title: Bayesian nonparametric estimation of survival functions with multiple-samples information
Subjects: Methodology (stat.ME); Applications (stat.AP); Computation (stat.CO)
[81]  arXiv:1705.02511 (replaced) [pdf, other]
Title: A generalized Gaussian process model for computer experiments with binary time series
Comments: 47 pages, 4 figures
Subjects: Methodology (stat.ME)
[82]  arXiv:1705.07926 (replaced) [pdf, other]
Title: Upstream Causes of Downstream Effects
Subjects: Applications (stat.AP)
[83]  arXiv:1705.08580 (replaced) [pdf, other]
Title: Provable Estimation of the Number of Blocks in Block Models
Comments: 12 pages, 4 figure; AISTATS 2018
Subjects: Machine Learning (stat.ML); Methodology (stat.ME)
[84]  arXiv:1706.09152 (replaced) [pdf, other]
Title: Generative Bridging Network in Neural Sequence Prediction
Comments: Accepted to NAACL 2018
Subjects: Artificial Intelligence (cs.AI); Learning (cs.LG); Machine Learning (stat.ML)
[85]  arXiv:1707.00222 (replaced) [pdf, ps, other]
Title: Sample Size for Pilot Studies and Precision Driven Experiments
Subjects: Applications (stat.AP)
[86]  arXiv:1708.02582 (replaced) [pdf, other]
Title: Cascade Adversarial Machine Learning Regularized with a Unified Embedding
Comments: 16 pages, 9 figures, International Conference on Learning Representations (ICLR) 2018
Subjects: Machine Learning (stat.ML); Learning (cs.LG)
[87]  arXiv:1708.06992 (replaced) [pdf, other]
Title: Econométrie et Machine Learning
Comments: in French
Subjects: Other Statistics (stat.OT)
[88]  arXiv:1709.00483 (replaced) [pdf, other]
Title: Iteratively Linearized Reweighted Alternating Direction Method of Multipliers for a Class of Nonconvex Problems
Subjects: Numerical Analysis (cs.NA); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA); Optimization and Control (math.OC); Machine Learning (stat.ML)
[89]  arXiv:1709.01781 (replaced) [pdf, other]
Title: Parameterizations for Ensemble Kalman Inversion
Subjects: Numerical Analysis (math.NA); Optimization and Control (math.OC); Methodology (stat.ME)
[90]  arXiv:1710.02238 (replaced) [pdf, other]
Title: How Much Chemistry Does a Deep Neural Network Need to Know to Make Accurate Predictions?
Comments: In Proceedings of 2018 IEEE Winter Conference on Applications of Computer Vision (WACV)
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)
[91]  arXiv:1710.03830 (replaced) [pdf, other]
Title: Inference on Auctions with Weak Assumptions on Information
Subjects: Econometrics (econ.EM); Computer Science and Game Theory (cs.GT); Learning (cs.LG); Statistics Theory (math.ST)
[92]  arXiv:1710.06462 (replaced) [pdf, other]
Title: S-Isomap++: Multi Manifold Learning from Streaming Data
Subjects: Machine Learning (stat.ML); Learning (cs.LG)
[93]  arXiv:1710.06660 (replaced) [pdf, other]
Title: Variable selection for the prediction of C[0,1]-valued AR processes using RKHS
Comments: 40 pages, 6 figures
Subjects: Methodology (stat.ME)
[94]  arXiv:1710.09793 (replaced) [pdf, other]
Title: Statistical Inference on Tree Swallow Migrations with Random Forests
Comments: 32 pages, 9 figures. Work between Cornell Lab of Ornithology and University of Pittsburgh Department of Statistics. Substantially overhauled all sections for submission to JASA A&CS
Subjects: Populations and Evolution (q-bio.PE); Applications (stat.AP)
[95]  arXiv:1710.10570 (replaced) [pdf, other]
Title: Weight Initialization of Deep Neural Networks(DNNs) using Data Statistics
Comments: The work is currently under progress
Subjects: Learning (cs.LG); Machine Learning (stat.ML)
[96]  arXiv:1710.11070 (replaced) [pdf, ps, other]
Title: Convergence Rates of Latent Topic Models Under Relaxed Identifiability Conditions
Authors: Yining Wang
Comments: 26 pages, 1 table. Added significantly more expositions, and a numerical procedure to check the order of degeneracy. Proofs slightly altered with explicit constants given at various places
Subjects: Machine Learning (stat.ML); Learning (cs.LG)
[97]  arXiv:1710.11439 (replaced) [pdf, other]
Title: Statistical Speech Enhancement Based on Probabilistic Integration of Variational Autoencoder and Non-Negative Matrix Factorization
Comments: 5 pages, 3 figures, version that Eqs. (9), (19), and (20) in v2 (submitted to ICASSP 2018) are corrected. Samples available here: this http URL
Subjects: Sound (cs.SD); Learning (cs.LG); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)
[98]  arXiv:1711.06758 (replaced) [pdf, other]
Title: Improving particle filter performance by smoothing observations
Comments: 15 pages, 6 figures
Subjects: Applications (stat.AP)
[99]  arXiv:1712.01995 (replaced) [pdf]
Title: Short-Term Prediction of Signal Cycle in Actuated-Controlled Corridor Using Sparse Time Series Models
Subjects: Applications (stat.AP)
[100]  arXiv:1712.02034 (replaced) [pdf, other]
Title: SMILES2Vec: An Interpretable General-Purpose Deep Neural Network for Predicting Chemical Properties
Comments: Submitted to SIGKDD 2018
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Learning (cs.LG)
[101]  arXiv:1712.02734 (replaced) [pdf, other]
Title: Using Rule-Based Labels for Weak Supervised Learning: A ChemNet for Transferable Chemical Property Prediction
Comments: Submitted to SIGKDD 2018
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)
[102]  arXiv:1712.09203 (replaced) [pdf, ps, other]
Title: Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations
Comments: clarified that the sensing matrices can be symmetric wlog and revised the quadratic neural nets section
Subjects: Learning (cs.LG); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC); Machine Learning (stat.ML)
[103]  arXiv:1801.02950 (replaced) [pdf, other]
Title: Adversarial Deep Learning for Robust Detection of Binary Encoded Malware
Comments: 1ST Deep Learning and Security Workshop (co-located with the 39th IEEE Symposium on Security and Privacy)
Subjects: Cryptography and Security (cs.CR); Learning (cs.LG); Machine Learning (stat.ML)
[104]  arXiv:1801.04590 (replaced) [pdf, other]
Title: Frame-Recurrent Video Super-Resolution
Comments: Accepted to CVPR 2018
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
[105]  arXiv:1801.09269 (replaced) [pdf, ps, other]
Title: Wasserstein Riemannian Geometry of Positive Definite Matrices
Comments: Second version. Submitted
Subjects: Statistics Theory (math.ST)
[106]  arXiv:1802.00982 (replaced) [pdf, ps, other]
Title: Stochastic integral with respect to the mixed fractional Brownian motion and drift estimation of the mixed fraction Ornstein-Ulenbeck process
Subjects: Statistics Theory (math.ST)
[107]  arXiv:1802.07444 (replaced) [pdf, other]
Title: Scaling-up Split-Merge MCMC with Locality Sensitive Sampling (LSS)
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Methodology (stat.ME); Machine Learning (stat.ML)
[108]  arXiv:1802.08667 (replaced) [pdf, ps, other]
Title: Double/De-Biased Machine Learning Using Regularized Riesz Representers
Comments: 15 pages; fixed several typos + updated references
Subjects: Machine Learning (stat.ML); Econometrics (econ.EM); Statistics Theory (math.ST)
[109]  arXiv:1802.08735 (replaced) [pdf, other]
Title: A DIRT-T Approach to Unsupervised Domain Adaptation
Comments: ICLR 2018
Subjects: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)
[110]  arXiv:1803.01639 (replaced) [pdf, ps, other]
Title: When do we have the power to detect biological interactions in spatial point patterns?
Comments: Main text 18 pages on 12pt font, 4 figures. Appendix 7 pages
Subjects: Populations and Evolution (q-bio.PE); Methodology (stat.ME)
[111]  arXiv:1803.02626 (replaced) [pdf, other]
Title: Inferring health conditions from fMRI-graph data
Comments: V1: 35 pages, 5 figures, 2 tables. V2: 36 pages, 5 figures, 2 tables; partially rewritten all sections and added references
Subjects: Quantitative Methods (q-bio.QM); Neurons and Cognition (q-bio.NC); Applications (stat.AP)
[112]  arXiv:1803.04051 (replaced) [pdf, other]
Title: Representation Learning over Dynamic Graphs
Subjects: Learning (cs.LG); Machine Learning (stat.ML)
[113]  arXiv:1803.04926 (replaced) [src]
Title: Active Reinforcement Learning with Monte-Carlo Tree Search
Comments: 11 pages, 10 figures S. Schulze is in a sponsored PhD position which stipulates that the sponsor needs to be notified 30 days in advance of any publication and give his consent. As this has not happened to this point, we would like to withdraw the publication
Subjects: Learning (cs.LG); Machine Learning (stat.ML)
[114]  arXiv:1803.05664 (replaced) [pdf, other]
Title: Conditional Model Selection in Mixed-Effects Models with cAIC4
Subjects: Computation (stat.CO); Applications (stat.AP)
[115]  arXiv:1803.06030 (replaced) [pdf, other]
Title: Estimation of lactate threshold with machine learning techniques in recreational runners
Comments: 33 pages, 16 figures
Journal-ref: Applied Soft Computing, 63, 181-196 (2018)
Subjects: Machine Learning (stat.ML)
[116]  arXiv:1803.06058 (replaced) [pdf, other]
Title: Constant-Time Predictive Distributions for Gaussian Processes
Subjects: Learning (cs.LG); Machine Learning (stat.ML)
[ total of 116 entries: 1-116 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)