\documentclass[nojss]{jss} % \VignetteIndexEntry{GET: Hotspot detection on a linear network} % \VignetteKeywords{false discovery rate, linear network, Monte Carlo test, R, road accidents, spatial point pattern} % \VignettePackage{GET} % \VignetteEngine{R.rsp::rsp} %% -- LaTeX packages and custom commands --------------------------------------- %% recommended packages \usepackage{thumbpdf,lmodern} %% another package (only for this demo article) \usepackage{framed} %% new custom commands \newcommand{\class}[1]{`\code{#1}'} \newcommand{\fct}[1]{\code{#1()}} \newcommand{\TT}{\mathbf{T}} \newcommand{\lo}{\mathrm{low}} \newcommand{\up}{\mathrm{upp}} \providecommand{\1}{\mathbf{1}} \newtheorem{definition}{Definition}[section] \newtheorem{remark}{Remark}[section] \usepackage{amsmath,amsfonts,amssymb} \makeatletter \setkeys{Gin}{width=\Gin@nat@width} \makeatother %% For Sweave-based articles about R packages: %% need no \usepackage{Sweave} %% -- Article metainformation (author, title, ...) ----------------------------- %% - \author{} with primary affiliation %% - \Plainauthor{} without affiliations %% - Separate authors by \And or \AND (in \author) or by comma (in \Plainauthor). %% - \AND starts a new line, \And does not. \author{Mari Myllym{\"a}ki\\Natural Resources Institute Finland (Luke) \AND Tom\'a\v s Mrkvi\v cka\\University of South Bohemia %, \v{C}esk\'e Bud\v{e}jovice, Czech Republic \AND Stanislav Kraft \\University of South Bohemia \AND Vojt\v ech Bla\v zek \\University of South Bohemia \AND Michal Konopa \\University of South Bohemia} \Plainauthor{Mari Myllym{\"a}ki et al.} %% - \title{} in title case %% - \Plaintitle{} without LaTeX markup (if any) %% - \Shorttitle{} with LaTeX markup (if any), used as running title \title{\pkg{GET}: Hotspot detection on a linear network} \Plaintitle{GET: Hotspot detection on a linear network} \Shorttitle{\pkg{GET}: Hotspot detection} %% - \Abstract{} almost as usual \Abstract{ This vignette describes and shows how the methodology proposed by \citet{MrkvickaEtal2023} for detecting hotspots on a linear network can be performed using the \proglang{R} package \pkg{GET} \citep{MyllymakiMrkvicka2024}. } %% - \Keywords{} with LaTeX markup, at least one required %% - \Plainkeywords{} without LaTeX markup (if necessary) %% - Should be comma-separated and in sentence case. \Keywords{false discovery rate, hotspot, linear network, Monte Carlo test, road accidents, \proglang{R}, spatial point pattern} \Plainkeywords{false discovery rate, hotspot, linear network, Monte Carlo test, road accidents, R, spatial point pattern} %% - \Address{} of at least one author %% - May contain multiple affiliations for each author %% (in extra lines, separated by \emph{and}\\). %% - May contain multiple authors for the same affiliation %% (in the same first line, separated by comma). \Address{ Mari Myllym{\"a}ki\\ Natural Resources Institute Finland (Luke)\\ Latokartanonkaari 9\\ FI-00790 Helsinki, Finland\\ E-mail: \email{mari.myllymaki@luke.fi}\\ URL: \url{https://www.luke.fi/en/experts/mari-myllymaki/} \\ \emph{and}\\ Tom\'a\v s Mrkvi\v cka \\ Faculty of Agriculture and Technology \\ University of South Bohemia, \\ Studentsk{\'a} 1668 \\ 37005 \v{C}esk\'e Bud\v{e}jovice, Czech Republic \\ E-mail: \email{mrkvicka.toma@gmail.com} \\ URL: \url{http://home.ef.jcu.cz/~mrkvicka/} } \begin{document} \section{Practical description of hotspots computation} The first step involves import data to \proglang{R}. The crashes are recorded as a point pattern, thus x and y coordinates together with window range must be provided. The same holds for crossroads that forms the vertices of the linear network. The edges of the linear network are provided in the form of matrix, where first column corresponds to the order of the crossroad where the edge starts and the second column corresponds to the order of the ending crossroad. Further, the covariate can be imported to R from tiff or raster format. When all files are prepared, the analysis can move to \proglang{R} with use of \pkg{spatstat}, \pkg{GET} and \pkg{parallel} packages. This vignette provides an example of hotspots detection that can be easily customized. \subsection{Estimating Poisson point pattern} The function \fct{pois.lppm}, can be used to estimate the inhomogeneous Poisson point process model on linear network. This function provides the \code{firstordermodel}, i.e. the regression model of dependence of crashes on the spatial covariates, \code{EIP}, i.e. estimated inhomogeneous intensity from the data and \code{secondorder}, i.e. estimation of the inhomogeneous $K$-function. The \code{plot} of the \code{secondorder} provides diagnostics, if the model is adequate for the data. If the estimated $K$-function lies close to the theoretical line, the data does not report any clustering, and the function \fct{hotspots.poislpp} can be used for final hotspots detection. If the estimated $K$-function does not lie close to the theoretical line, and it is above, the data report clustering, and the a clustered point pattern model must be fitted to the data and hotspots detected using this clustered model instead. The important input parameters to be specified for the function \fct{hotspots.poislpp} are \code{PP}, i.e., the point pattern used for estimation, \code{formula}, i.e., the linear regression formula specified as usually in \proglang{R} having \code{PP} on the right hand side of the formula (i.e., as the response variable), data, i.e., the object from which the formula takes the data. \subsection{Estimating Mat{\'e}rn cluster point pattern} The function \fct{MatClust.lppm}, can be used to estimate the Mat{\'e}rn cluster point pattern with inhomogeneous cluster centers on linear network. This function provides the same outputs as the \fct{pois.lppm} and further estimated parameters $\alpha$ and $R$. The \code{secondorder} provides again the diagnostics for checking if the clustered model is appropriate. The sample $K$-function must be close to the $K$-function of the estimated model (green line). If it is not the case the searching grid for parameters $\alpha$ and $R$ that is input in the function must be manipulated to get the a closer result. If the estimated model is adequate one can proceed to the hotspot detection with the use of the function \fct{hotspots.MatClustlpp}. Remark here, that for the estimation of the second order structure a smaller data can be used than for the estimation of the first order structure in order to save the computation time, since the second order is a local characteristics. Thus the input to this function can contain, in addition to the full data in PP that is used for first order estimation, a subwindow \code{subwin} to specify a smaller part of the full data for second order estimation. Furthermore, \code{valpha}, i.e., vector of proposed alphas which should be considered in the optimization, \code{vR}, i.e., vector of proposed Rs which should be considered in the optimization must be provided. The user can also specify how many cores should be used in the computation by parameter \code{ncores}. \subsection{Hotspot detection under the Poisson assumption} If the Poisson assumption is checked, the hotspots can be detected using the function \fct{hotspots.poislpp}. The plot of results contains the locations of determined hotspots together with their sizes. A parameter \code{sigma} must be provided in this function. It determines the bandwidth of the kernel used in the inhomogeneous intensity estimation. This parameter should be carefully selected with respect to the size of the window. It represents how much smoothing is applied on the intensity, it is too big, the inhomogeneity will be blurred away. If it is too small, the intensity will react on every event and the inhomogeneity will be too crazy. The \code{nsim} parameter specifies the number of simulations to perform the envelope. It should be as large as possible. Usually, the default of 10000 is fine. The argument \code{ncores} can be used to specify how many cores should be used for the computation. \subsection{Hotspot detection under the Mat{\'e}rn cluster assumption} If the Matérn cluster process was estimated and the Mat{\'e}rn cluster assumption was checked, the hotspots can be detected using the function \fct{hotspots.MatClustlpp}. This function has the same parameters as the previous one. Moreover, the estimated parameters $\alpha$ and $R$ from \fct{MatClust.lppm} must be provided. \section{R preparations} Loading required packages and setting a \pkg{ggplot2} theme for images. % \begin{Schunk} \begin{Sinput} R> library("GET") R> library("spatstat") R> library("spatstat.linnet") R> library("spatstat.Knet") R> library("ggplot2") R> library("parallel") R> theme_set(theme_bw(base_size = 9)) \end{Sinput} \end{Schunk} % \section{Data} \citet{MrkvickaEtal2023} worked with the database of road crashes reported to the Police in the Czech Republic from 1 January 2016 to 31 December 2020. Here we show the methodology for a subpattern of this full data set. The \pkg{GET} package provides a data object \code{roadcrash} that has 7700 road crashes lying on a linear network with 269 vertices and 354 lines. Load the road crash data from \pkg{GET}: % \begin{Schunk} \begin{Sinput} R> data("roadcrash") R> win <- owin(xrange = roadcrash$xrange, yrange = roadcrash$yrange) R> X <- ppp(x = roadcrash$x, y = roadcrash$y, window = win) R> Vertices.pp <- ppp(x = roadcrash$Vertices.x, y = roadcrash$Vertices.y, window=win) R> L <- linnet(vertices=Vertices.pp, edges = roadcrash$Edges) R> PPfull <- lpp(X, L) R> roadcrash$Traffic <- im(roadcrash$Traffic, xrange = roadcrash$xrange, yrange = roadcrash$yrange) R> roadcrash$ForestDensity <- im(roadcrash$ForestDensity, xrange = roadcrash$xrange, yrange = roadcrash$yrange) R> roadcrash$BuildingDensity <- im(roadcrash$BuildingDensity, xrange = roadcrash$xrange, yrange = roadcrash$yrange) \end{Sinput} \end{Schunk} % A part of the analysis, as will be described below, uses a subset of the \code{roadcrash} data, because the computations of inhomogeneous $K$-function and density can be rather computational. Here we define the subwindow and plot the pattern living in the subwindow. % \begin{Schunk} \begin{Sinput} R> subwin <- owin(c(-760000, -740000), c(-1160000, -1140000)) R> plot(PPfull[, subwin], main="Road crashes: subpattern") \end{Sinput} \end{Schunk} \includegraphics{HotSpots-data_crashes_thin} % \citet{MrkvickaEtal2023} had a total of 9 spatially defined covariates. In our example here and available in \code{roadcrash} in \pkg{GET} are three covariates, namely average traffic volume (number of vehicles per 24 hours), forest density and building density in the cell. The following plots show these covariates in the subwindow defined above. % \begin{Schunk} \begin{Sinput} R> par(mfrow=c(1,3)) R> plot(roadcrash$Traffic[subwin], main="Traffic") R> plot(roadcrash$ForestDensity[subwin], main="Forest density") R> plot(roadcrash$BuildingDensity[subwin], main="Building density") \end{Sinput} \end{Schunk} \includegraphics{HotSpots-data_covariates_subwin} % \section{Non-parametric intensity estimate} A non-parametric density estimate of the point pattern on a linear network can be obtained using the function \fct{density.lpp} of the \pkg{spatstat} package. A parameter \code{sigma} must be provided in this function. It is the same parameter \code{sigma} that was already discussed above in Section 1.3, i.e., it determines the bandwidth of the kernel used in the inhomogeneous intensity estimation. The argument \code{distance} specifies what type of kernel to use in the linear network. In our hotspot detection, we use here a two-dimension kernel specified by \code{distance="euclidean"} because computation of the density with this kernel is relatively fast. We set the smoothing bandwidth \code{sigma} that small that two roads are very unlikely closer than two times \code{sigma} apart from each other. Thus, the intensity estimate at a certain location on the linear network is computed merely from the crashes at that location. % \begin{Schunk} \begin{Sinput} R> densi <- density.lpp(PPfull, sigma = 250, distance="euclidean") R> densi2 <- density.lpp(PPfull[, subwin], sigma = 250, distance="euclidean") R> par(mfrow=c(1,3)) R> plot(densi, main="Intensity of crashes: full window") R> plot(densi2, main="Intensity of crashes: subwindow") \end{Sinput} \end{Schunk} \includegraphics{HotSpots-data_density} % \section{Fitting the inhomogeneous Poisson process} The simplest point process model for road crashes is the (inhomogeneous) Poisson process with intensity \begin{equation} \rho_\beta(u)=\kappa \exp (z(u) \beta^T), \ u \in L, \label{loglin} \end{equation} where $L$ is a linear network, $z=(z_1, \ldots , z_k)$ is a vector of covariates and $\beta=(\beta_1, \ldots , \beta_k)$ is a regression parameter. This process can be fitted using the \pkg{spatstat} package. We fit the model using the full \code{roadcrash} data. The function \fct{pois.lppm} both fits the model for the intensity as well as provides predicted point process intensity and the inhomogeneous $K$-function estimated from a given point pattern \code{PP} using the estimated intensity surface. (This is rather fast, taking a bit more than 10 seconds on a normal laptop; system.time is used to take the time only.) % \begin{Schunk} \begin{Sinput} R> myformula <- PP ~ Traffic + ForestDensity + BuildingDensity R> system.time( Poi <- pois.lppm(PP=PPfull, formula=myformula, data=roadcrash) ) \end{Sinput} \begin{Soutput} user system elapsed 12.83 0.05 12.92 \end{Soutput} \end{Schunk} % Both the predicted point process intensity and the inhomogeneous $K$-function with theoretical Poisson $K$ function can be plotted: % \begin{Schunk} \begin{Sinput} R> par(mfrow=c(1,2)) R> plot(Poi$EIP, main="Predicted intensity") R> plot(Poi$secondorder, main="Inhomog. K function") \end{Sinput} \end{Schunk} \includegraphics{HotSpots-data_poisson_EIP} % Here the inhomogeneous $K$-function estimated from the data lies above the theoretical line for the Poisson process and suggests clustering of points. \section{Fitting the Matern cluster process on a linear network} \citet{MrkvickaEtal2023} considered instead of the Poisson process the Matern cluster point process with inhomogeneous cluster centers. This process is more suitable for clustered data. It can be estimated in two steps according to its construction following \citet{MrkvickaEtal2014}. In first step, the first order intensity function is estimated through Poisson likelihood. This was done above, i.e., the object \code{EIP} contains the estimated intensity. In second step, the second order interaction parameters $\alpha$ (mean number of points in a cluster) and $R$ (cluster radius) are estimated through minimum contrast method. Unfortunately, working with cluster processes on linear networks is rather consuming and therefore they are currently not covered by the \pkg{spatstat} package. Thus, we have used the inhomogeneous $K$-function and the minimum contrast and grid search methods to find the optimal parameters. We implemented functions for simulating the Matern cluster process on a linear network \code{LL} with pre-specified centers (function \code{rMatClustlpp}) and for fitting the Matern cluster process on a point pattern on a linear network (function \code{MatClust.lppm}). In the procedure, we estimate the parameters $R$ and $\alpha$ of the Matern cluster process using the inhomogeneous $K$-function (with the estimated Poisson process intensity). We consider a range of possible values of the parameters $R$ and $\alpha$. For each value of $R$ and $\alpha$, we compute the difference of the observed $K$-function from the "theoretical" $K$-function of the model, computed from the average of \code{nsim} (by default 10) simulation from the model. Simulations are used, because the theoretical $K$-function is not known for the Matern cluster process on the linear network. We then find out which of these possible values of parameters $R$ and $\alpha$ lead to the smallest difference between the observed and "theoretical" $K$-functions. Remark here that the first order structure is estimated below from the full pattern \code{PPfull} (provided to the \fct{MatClust.lppm} function in the argument \code{PP}), whereas the second order structure is estimated from the pattern observed in the subwindow that is provided in the argument \code{subwin}. The second order structure has a limited range; therefore, estimating it only from the subpattern is useful for time saving. % % % \begin{Schunk} \begin{Sinput} R> valpha <- seq(5, 30, by=5) R> vR <- seq(250, 2500, by=500) R> myformula <- PP ~ Traffic + ForestDensity + BuildingDensity R> system.time( # Took about 1,2 minutes on a laptop MatCl <- MatClust.lppm(PP=PPfull, formula=myformula, subwin=subwin, valpha=valpha, vR=vR, data=roadcrash, ncores = 1) ) \end{Sinput} \begin{Soutput} user system elapsed 65.30 1.67 67.21 \end{Soutput} \end{Schunk} % These results can be viewed, by plotting the observed $K$ (solid line) and theoretical Poisson line (dashed line) and adding the K-function of the estimated Matern cluster process estimated from \code{nsim} (here 10) simulations. % \begin{Schunk} \begin{Sinput} R> # The observed K, and theoretical Poisson line R> plot(MatCl$secondorder, main="Inhomog. K function") R> # The Matern Cluster process K from nsim (here 10) simulations R> # with chosen values of alpha and R R> lines(x=MatCl$secondorder$r, y=MatCl$MCsecondorder, col=3) \end{Sinput} \end{Schunk} \includegraphics{HotSpots-data_MatClust_param_sim} % The chosen parameter values are given in \code{alpha} and \code{R}. % \begin{Schunk} \begin{Sinput} R> MatCl$alpha \end{Sinput} \begin{Soutput} [1] 30 \end{Soutput} \begin{Sinput} R> MatCl$R \end{Sinput} \begin{Soutput} [1] 1750 \end{Soutput} \end{Schunk} % \section{False discovery rate envelopes} To find the hotspots of road crashes that are not explained by the covariates, we first generate \code{nsim} simulations from the fitted Mat{\'e}rn cluster process and estimate the intensity for each of the simulated patterns. We note that we estimate the intensity here similarly as above for the observed pattern. This computation takes a bit of time (using 4 cores on a normal laptop took about 25-30 minutes). % \begin{Schunk} \begin{Sinput} R> nsim <- 10000 R> system.time( res <- hotspots.MatClustlpp(PP=PPfull, formula=myformula, R=MatCl$R, alpha=MatCl$alpha, data = roadcrash, sigma=250, nsim=nsim, ncores=4) ) \end{Sinput} \end{Schunk} % % % The FDR envelope \citep{MrkvickaMyllymaki2023} is computed within \fct{hotspots.MatClustlpp} using the function \fct{fdr\_envelope} of the \pkg{GET} package. Because we are only interested in locations where the intensity is higher than expected, the test is done alternative to \code{"greater"}. % \begin{Schunk} \begin{Sinput} R> plot(res) + scale_radius(range = 0.5 * c(1, 6)) \end{Sinput} \end{Schunk} \includegraphics{HotSpots-fdrenvelope} % The size of the cluster is indicated by the circles. The circle radius is proportional to the size of the deviation of the observed intensity from the upper bound of the FDR envelope divided by the difference of the upper FDR envelope and the centre of the envelope. Thus, the size is a measure of relative exceedance. \section*{Acknowledgements} We thank Mikko Kuronen for helping to make the code faster. %% -- Bibliography ------------------------------------------------------------- \bibliography{GETbibfile} \end{document}