What drove the creation of an HMC (Home Maintenance Company) on this online journal?
To effectively illustrate the diverse capabilities of TensorFlow Probability (TFP), we commenced exploring exemplars of methodologies for fitting hierarchical models using one of TFP’s joint distribution classes in conjunction with Hamiltonian Monte Carlo. While significant technical advancements have been made on their own merit, we intentionally omitted introducing the “mathematical aspects” from our initial discussion.
Since it’s impossible to provide a comprehensive introduction to Bayesian modeling and Markov Chain Monte Carlo in a brief blog post, considering that there are already many excellent texts doing so, let us assume some prior knowledge. We concentrate specifically on the latest advancements and most effective techniques in Bayesian inference, namely Hamiltonian Monte Carlo, No-U-Turn Samplers (NUTS), with the objective of simplifying complex concepts and making them easily accessible.
Welcome to our curated glossary with fascinating stories behind each term.
So what’s it for?
Methods for generating random samples, or statistical inference techniques, are employed when we aim to produce representative samples from populations that lack a closed-form probability distribution. Usually, we’re interested in the characteristics of the samples themselves, often seeking to calculate metrics like mean and variance to understand the underlying distribution.
What distribution? When discussing such purposes, we consider a joint distribution, namely, a probabilistic model that aims to capture an underlying reality. Starting with the most fundamental scenario, it seems that
Here is the rewritten text:
The joint distribution consists solely of a single Poisson distribution, which serves as a model for representing, for instance, the number of comments in a code review. We possess insight into nuanced coding perspectives, namely that.
Which parameters of the Poisson distribution should we select to maximize this information? Since there is no prior distribution defined for the parameter in question, our analysis fundamentally lacks a Bayesian perspective? However, to maintain a Bayesian approach, we incorporate informative prior distributions for the model’s parameters.
This being a joint distribution, we now seek to determine the values of μ, σ, and ρ.
What excites us most is the potential to work within the defined parameters.
Given the complex distributions involved, calculating posterior distributions in a closed form is generally not possible. Sampling methods enable us to discover these parameters in the absence of exact values. What we’d wish to level out as a substitute is the next: Within the upcoming discussions of sampling, HMC & co., it’s very easy to overlook . At all times, recall that what we’re sampling is not the information itself, but rather the underlying parameters – specifically, the parameters of the posterior distributions that excite our interest.
Sampling
Sampling strategies typically encompass two primary stages: generating a template (or proposal) and determining whether to sustain or discard the pattern. In essence, when confronted with measured outcomes and seeking an explanatory mechanism, it’s reasonable to assume that the explanation itself should be more straightforward. Our goal is intuitively clear: we aim to determine the likelihood of the information under the hypothetical model’s parameters. However, to help us get started, please consider the following.
Conceptually, straightforward strategies do exist for generating samples from an unknown distribution in a closed form – provided that unnormalized probabilities can be evaluated, and the dimensionality of the problem remains relatively low. While concise portraits exist for strategies like uniform sampling, significance sampling, and rejection sampling (see), they are rarely employed in MCMC software due to inefficiencies and unsuitability in high-dimensional settings. Prior to HMC’s dominance in this type of software, two other algorithms – and – vied for prominence. Each concept is properly and understandably defined, as typified in classic stories like Metropolis, and we direct readers to our primary sources. Unlike traditional methods, HMC has been demonstrated to be significantly more environmentally friendly due to its random-walk behavior, as each proposal builds upon the current state, leading to highly correlated samples and slow state exploration.
HMC
Compared to traditional random-walk-based methods, the Harmony Memory Combination (HMC) algorithm has gained popularity due to its environmentally friendly approach, offering a more sustainable solution. Unfortunately, grasping it becomes significantly more challenging. There seem to be at least three distinct languages used to describe an algorithm: mathematical notation; code, including pseudo-code that may or may not closely resemble math notation; and a third, which spans the gamut from extremely abstract to exceedingly concrete, even visual. While HMC holds a unique significance for me, its conceptual frameworks, although captivating, surprisingly yield limited intuitive comprehension compared to the underlying mathematical equations and coding principles. For individuals well-versed in physics, statistical mechanics, and differential geometry, the experience will likely be entirely distinct.
Bodily analogies provide the most effective way to start any discussion.
Bodily analogies
The traditional bodily analogy is presented in the referenced article by Radford Neal, “MCMC using Hamiltonian dynamics,” with its definition clarified in a straightforward manner.
We seek to optimize a crucial “factor”, namely the logarithmic likelihood of the data beneath our model’s parameters. In an effort to minimize the adverse impact on log likelihood, we aim to decrease it. As the “factor” is optimised, it can be imagined as an object traversing a landscape of peaks and troughs, much like gradient descent in deep learning, where the goal is to guide it to a stable equilibrium at the bottom of a valley.
In Neal’s personal phrases
In two-dimensional space, the dynamics can be visualized as a frictionless puck gliding effortlessly across a terrain of diverse surfaces and shapes. The state of this method is defined by the puck’s position, represented as a 2D vector q, and its momentum, comprising the product of mass and velocity, expressed as another 2D vector p.
When you hear “momentum” – particularly in the context of intense learning – it’s likely to evoke a sense of familiarity, even if the underlying analogies don’t quite align. Momentum is frequently commended for its ability to circumvent futile oscillations in biased optimization environments, thereby promoting more efficient exploration of the solution space.
While leveraging HMC capabilities, the primary focus remains fixed on exploring the concept of.
The probability of a system being in a particular state is inversely exponentially correlated with its energy level. This equation will be set to one from now on, effectively simplifying further calculations.
In college physics, students are likely familiar with the concept of power coming in two forms: potential power and kinetic power. In the context of a sliding object, an item’s potential energy is directly related to its height or position, while its kinetic energy is defined as its momentum, which is a measure of the object’s mass and velocity. The system governing this interaction is governed by Newtonian mechanics.
Without kinetic energy, the item would continue sliding down the slope until it reached the point where gravity’s force was counterbalanced by the upward slope of the terrain, at which point it would cease moving. As momentum builds, it’s poised to tackle an uphill climb, much like the rush of speed on a bike lets you crest a brief hill without pedaling.
In order that’s kinetic power. What remains to be explored is the opposing force, the untapped potential that mirrors the very factors we are seeking to uncover:
The key innovation of HMC lies in augmenting the current state of the “house of curiosity” – namely, the vector of posterior parameters – by a carefully crafted momentum vector, thereby amplifying the optimization’s effectiveness. As the project reaches its conclusion, the initial drive and energy invested in the momentum phase are ultimately discarded. This aspect is thoroughly explained by Ben Lambert in his informative video.
Based on his notation and exposition, the expression for the total energy, comprising both potential and kinetic contributions, can be written as follows:
The corresponding probability, according to the relationship outlined earlier, subsequently is.
Substituting T = 1 and m = 1 into the given equation yields
Now, the formulation’s assumption regarding the distribution of momentum takes the form of a normal and predictable pattern. Therefore, we can easily combine our momentum estimates and draw samples from the resulting posterior distribution.
What’s crucial to understand here is that observation itself is a complex process. It requires an ability to focus on specific details while simultaneously taking in the broader context. At each step, we
- Patterns reveal a brand-new momentum emerging from its inherent marginal distribution – a characteristic shared with the conditional distribution given that both are unbiased.
- Clarify the trajectory of the particle. When that is the place where you come into play.
Hamilton’s equations (equations of movement)
To alleviate potential misunderstandings, are you prepared to delve into the paper’s contents and shift to Radford Neal’s notation?
Hamiltonian dynamics operates on a d-dimensional phase space vector, q, and a d-dimensional momentum vector, p. The historic state house is characterized by its majestic architecture, a masterpiece of engineering and design:
Here, the potential power referred to as U above, and the kinetic power is described as K above, as a function of momentum p.
The partial derivatives of the Hamiltonian dictate the evolution of and over time, governed by Hamilton’s equations.
Theoretical foundations for solving partial differential equations require a deep understanding of mathematical concepts such as Fourier analysis and numerical methods. The cornerstone of numerical integration is the iterative process, where the independent variable (or unbiased parameter) advances in discrete steps, and a novel value of the dependent variable is calculated through the addition of the partial derivative to its current value. For the Hamiltonian system, performing calculations one equation at a time from left to right appears to be akin to a step-by-step process.
Initially, a fresh location is calculated using the current momentum and time; subsequently, a novel momentum is derived by employing the updated location, also in conjunction with time.
While this course of development may benefit from integrating the newly derived value from step one, we can indeed skip ahead to the modern software paradigm that relies on the methodology.
Leapfrog algorithm
After that, we’ve hit the second magic phrase. Unlike nowhere else, there’s significantly less thrill here. The leapfrog methodology represents a straightforward, eco-friendly alternative for performing numerical integration.
The procedure involves a reconfiguration of the original Step 1, dividing it into two distinct components: the pre-momentum replacement portion and the post-momentum replacement segment.
Each subsequent step leverages the previously calculated value of its corresponding variable-to-differentiate. In reality, a series of leapfrog steps are often performed prior to a proposal being presented; consequently, steps 3 and 1 from the subsequent iteration become intertwined.
The re-examination of this key phrase leads us back to the overarching plan’s framework. The Hamiltonian equations and leapfrog integration methods collectively produced a revised parameter value, awaiting acceptance or rejection. While the preceding discussions of the Metropolis algorithm provide a solid foundation for understanding, the manner in which this particular call is handled remains distinct from HMC’s core principles and thus warrants only a cursory overview here.
Acceptance: Metropolis algorithm
Below the Metropolis algorithm, proposed updates to and are accepted with probability.
When proposed parameters lead to a higher probability, they are adopted; otherwise, they are accepted only when there is a high certainty based on the proportion of outdated to new likelihood ratios.
In the context of Hamiltonian systems, the concept dictates that all proposals should be accepted, regardless of their magnitude. However, in practice, the precision limitations imposed by numerical integration can result in acceptance rates less than unity.
HMC in several strains of code.
As we’ve explored concepts and delved into mathematical formulas, it’s easy to overlook the overarching approach. Actually, Radford Neal’s paper does have some code as well. The content is being refined to better serve its purpose.
Hopefully, users will find this piece of code as useful as you do. What’s left to begin anew when all that’s left are shattered remnants of what once was? To date, we have yet to stumble upon the pivotal expression that will unlock the ultimate solution: NUTS. What, or who, is NUTS?
NUTS
A novel algorithm has emerged recently, specifically designed to tackle one of the most pressing challenges in applying Hamiltonian Monte Carlo (HMC): determining the optimal number of leapfrog steps to execute before generating a proposal. The NOUFS acronym stands for Non-Orthogonal Upsampling Flip Sampler, referencing the deliberate sidestepping of U-turn configurations within the optimization landscape when the number of leapfrog iterations is set too high.
The reference paper by Hoffman & Gelman additionally describes an answer to a associated problem: selecting the step measurement . The respective algorithm, , .
While NUTS may deviate from the conventional meaning in PC science, we will let this nuance stand, leaving readers to delve deeper into the referenced paper.
Let’s wrap up by drawing another conceptual analogy – Michael Betancourt’s successful launch of his satellite TV concept.
Proven Strategies for Avoiding Crashes: A Practical Guide
Bétancourt’s thought-provoking article stands out as a exemplary study, with individual paragraphs serving as engaging “teasers” that whet the reader’s appetite and invite further exploration.
In many real-world scenarios, a primary challenge arises from the inherent complexity of high-dimensional data, necessitating effective strategies for navigating this obstacle. In high-dimensional spaces, the density peak typically occurs at the center where it’s maximized; nonetheless, in reality, there can’t be much around it – akin to k-nearest neighbors, the more dimensions you add, the further your nearest neighbor becomes?
In the realm of physics, a crucial property called mass emerges as the product of two fundamental factors: quantity and density. It’s fascinating to note that this concept becomes increasingly diffuse when applied to higher-dimensional spaces.
Despite our best efforts, the elusive everyday set remains a challenge to uncover and maintain. As observed earlier, HMC leverages gradient data to converge towards the mode; however, by solely relying on the gradient of the log likelihood, it can stray from the target distribution and become stuck at a local maximum.
In this setting, momentum plays a crucial role – it offsets the gradient’s influence, ensuring the Markov chain remains within the designated scope. Here’s the satellite TV for PC analogy, as Bétancourt himself puts it:
Instead of attempting to create a few modes, a gradient, and a typical set, we can equivalently generate a few planets, a gravitational field, and an orbit (Determinant 14). The endeavour to investigate the mundane universe is replaced by a physical challenge: launching a satellite TV into a stable orbit around an imaginary planet, thereby blurring the lines between the theoretical and the tangible. As two distinct perspectives on the same mathematical framework exist, they are susceptible to similar afflictions. Since a satellite TV won’t survive a direct fall towards the Earth’s surface due to atmospheric friction and heat generated by air resistance, it would likely burn up or disintegrate before reaching the ground. Regardless of whether viewed from a probabilistic or biological standpoint, a devastating outcome is inevitable.
While the concept of bodily image may seem challenging, it offers a straightforward solution: even as stationary objects would crash into the planet if left unimpeded, we can maintain a stable orbit by imparting our satellite with sufficient momentum to offset the gravitational force. As we prepare to launch our satellite TV, it’s crucial that we carefully consider the momentum we impart. If we introduce insufficient transverse momentum with respect to the gravitational force, it is possible that the gravitational attraction may prove too strong, resulting in the satellite crashing into the planet. While incorporating a significant level of momentum, there is a risk that the gravitational force may prove too feeble to capture the satellite, instead allowing it to escape into the vastness of space.
And here’s the image I promised: Determine 16 from the paper.
And with this, we conclude. While reading this post, you might have gained valuable insights – unless you were already familiar with everything mentioned, in which case you probably wouldn’t have read it anyway. 😊
Thanks for studying!
Kruschke, John Ok. 2010. . 1st ed. Orlando, Florida, United States: Educational Press Incorporated
MacKay, David J. C. 2002. . New York, New York, United States: Cambridge University Press.