ParallelKDE

ParallelKDE is a package for flexible and efficient kernel density estimation (KDE), with a strong focus on parallel implementations. Its core estimator, the GradePro Estimator described here, supports CPU and GPU acceleration (threaded and CUDA), and its designed to scale with modern hardware. While the package is centered around grid-based KDEs, it also provides extensible infrastructure to support and implement other types of estimators.

The user interface is built around a modular design that separates concerns between grids, devices, density objects and estimation routines. This allows users to easily switch estimators, control execution targets (e.g., CPU or GPU), and prototype new estimation strategies without rewriting boilerplate code.

Typical usage involves:

Instantiating an estimation in a specific device (e.g., CPU, CUDA),
This also involves defining a grid for the estimation or using a default grid,
Estimating the density with a chosen estimator,
Accessing the resulting density.

Basic Example

For example, to estimate a density on a CPU with a default grid using the GradePro Estimator, you can use:

using ParallelKDE

data = randn(1, 10000) # 1-dimensional sample of 10000 points

density_estimation = initialize_estimation(
  data,
  grid=true, # default grid
  device=:cpu,
)
estimate_density!(
  density_estimation,
  :gradepro,
)

density_estimated = get_density(density_estimation)

We can evaluate the standard normal distribution for comparison:

using Distributions

grid_coordinates = get_coordinates(get_grid(density_estimation))[1, :]
density_true = pdf.(Normal(), grid_coordinates)

which would yield a plot like this:

using Plots

p = plot(grid_coordinates, density_true, label="True Density", color=:cornflowerblue, lw=2)
plot!(p, grid_coordinates, density_estimated, label="Estimated Density", color=:firebrick, lw=2)
plot!(p, xlabel="Random Variable", ylabel="Density")

Density Estimation for Conformational Samples

Here, we exemplify the use of ParallelKDE to estimate the conformational density of alanine dipeptide via dihedral angles.

We start by downloading and reading the dataset of dihedral angles obtained from molecular dynamics trajectories ^[1] ^[2].

using Downloads
using NPZ

url = "http://ftp.imp.fu-berlin.de/pub/cmb-data/alanine-dipeptide-3x250ns-backbone-dihedrals.npz"
dest = joinpath(pwd(), "ala_dipeptide_dihderals.npz")
Downloads.download(url, dest)

ala_npz = npzread(dest)
ala = vcat(values(ala_npz)...)
# afterwards one may need to subsample the dataset to obtain uncorrelated samples.

This time around we will define the grid that we want to use for the estimation instead of using the default grid:

using ParallelKDE

phi_range, psi_range = fill(range(-π, π, length=250), 2)
dihedral_grid = initialize_grid([phi_range, psi_range], device=:cpu) # or device=:cuda

density_estimation = initialize_estimation(
  ala', grid=dihedral_grid, device=:cpu # or device=:cuda
)
estimate_density!(density_estimation, :gradepro)
estimated_density = get_density(density_estimation)

Finally, we can create a contour plot of the estimated density:

using Plots
using LaTeXStrings

p = contourf(phi_range, psi_range, estimated_density')
plot!(p, xlabel=L"$\phi$", ylabel=L"$\psi$", colorbar_title="Estimated Density")

GradePro vs Rules of Thumb

ParallelKDE also provides a Rules of Thumb Estimator for bandwidth selection based on widely used rule-of-thumb heuristics. As with the GradePro Estimator, this estimator is available in both serial and CUDA variants. In practice, the Rules of Thumb approach can be extremely fast and yields competitive accuracy when the sample distribution is close to Gaussian; however, its performance can degrade rapidly as the underlying distribution departs from Gaussianity. Now, we illustrate this trade-off by comparing results produced by both estimators on samples drawn from a bimodal distribution.

We first initialize the distribution and obtain samples from it:

using Distributions

distro = MixtureModel(
  Normal[
    Normal(-6.0, 0.2),
    Normal(0.0, 2),
  ]
)
samples = rand(distro, 1, 10000) # 1-dimensional sample of 10000 points

We can again initialize a grid and the estimations for both methods with:

using ParallelKDE

xs = range(-10, 10, length=250)

grid_bimodal = initialize_grid([xs], device=:cpu) # or device=:cuda

estimation_gradepro = initialize_estimation(
  samples, grid=grid_bimodal, device=:cpu # or device=:cuda
)
estimation_rot = initialize_estimation(
  samples, grid=grid_bimodal, device=:cpu # or device=:cuda
)

Now we can execute both estimations and calculate the true density with:

estimate_density!(estimation_gradepro, :gradepro)
density_gradepro = get_density(estimation_gradepro)

estimate_density!(estimation_rot, :rot)
density_rot = get_density(estimation_rot)

density_true = pdf.(distro, xs)

Finally, we can visualize the results with:

using Plots

p = plot(xs, density_true, lw=3, ls=:dash, label="True Distribution")
plot!(p, xs, density_gradepro, lw=2, label="GradePro Estimation")
plot!(p, xs, density_rot, lw=2, label="Rule of Thumb Estimation")
plot!(p, xlabel="Random Variable", ylabel="Density")

Usage Summary

As is exemplified above, it is possible to initialize an estimation using

ParallelKDE.initialize_estimation — Function

initialize_estimation(data; kwargs...)

Initialize a density estimation object based on the provided data.

Arguments

data::Union{AbstractMatrix,AbstractVector{<:AbstractVector}}: The data to be used for density estimation.
grid::Union{Bool,G<:AbstractGrid}=false: Whether to create a grid for the density estimation.

If true, a grid will be created based on the data ranges. A grid can also be provided directly.

grid_ranges=nothing: The ranges for the grid coordinates if grid is true.

This has priority over other grid parameters.

dims=nothing: The dimensions of the grid if grid is true.
grid_bounds=nothing: The bounds for the grid if grid is true.
grid_padding=nothing: Padding for the grid if grid is true.
device=:cpu: The device to use for the density estimation. It should be compatible with the estimator to be used.

Examples

data = randn(1, 1000);
density_estimation = initialize_estimation(data; grid=true, grid_ranges=-5.0:0.1:5.0, device=:cpu);

Then, the density can be estimated with a chosen estimator and its settings using

ParallelKDE.estimate_density! — Function

estimate_density!(density_estimation::DensityEstimation, estimation_method::Symbol; kwargs...)

Estimate the density using the specified method and update the DensityEstimation object.

For a list of available estimation methods and their keywords, see the documentation for the specific estimator.

Finally, the estimated density can be accessed using

ParallelKDE.KDEs.get_density — Method

get_density(density_estimation::DensityEstimation; normalize=false, dx=nothing)

Obtain the estimated density from a DensityEstimation object.

If the normalize argument is set to true, the density will be normalized. If density_estimation has a grid, its spacing will be used for normalization. Otherwise, dx must be provided to normalize the density.

More details regarding the currently implemented estimators as well as further information about the package can be found throughout the documentation.