Blog

A repository/blog for some of my data analysis and coding pursuits.




Baking Planets From Scratch: DCGAN - 4/01/2020


I find the recent detection of exoplanetary bodies a truly fascinating discovery. Unfortunately, with current technology, direct imaging of these planets is not possible. Given that we cannot actually see what these exoplanets look like, I sometimes find myself imagining just that. This curiosity is the motivation for this particular blog post. In this post I have used a Deep Convolutional Generative Adversarial Neural Network (DCGAN) model to ingest a data set of planetary images (at least, mostly!) to generate sets of entirely new planets from features learned by the model on the input data set. The DCGAN works through two separate neural network models, one generating newly formed images, and the other attempting to discriminate between the real training images and the fake generated images. As the model proceeds, the generator becomes more proficient at creating new images and the discriminator at discerning the differences between real and fake instances. The model is succesful when the generator is able to produce a set of images which are more challenging for the discriminator to differentiate from the real input set.


The base model is configured to work with images of 64X64 pixel resolution, though I also modified it to work with images at a resolution of 128X128. The input data set consists of a relatively small sample of ~350 manually-downloaded planetary images (with a few interlopers thrown in for fun). Given this limited number of training examples I expanded the training data set by creating numerous duplications of each separate training image with a random rotation, zoom, and application of gaussian white noise.


Below are the results I achieved after 40 training epochs for both 64X64 and 128X128 resolution models, requiring a full week of computing time on my antiquated PC.


64X64 Model Results:

64X64 Model Training Movie/GIF:

128X128 Model Results:

128X128 Model Training Movie/GIF:






Interactive Heat Conduction Visualization - 9/29/2019

*All data used and codes generated as part of this blog entry are available to download on my github repository.

Ever wonder what heat looks like moving through a solid material? This interactive visualization tool I created with JavaScript will show you! Just pick a material, size of area to be heated, and relative temperature from the drop down menus below and click away on the plot. The plot will then show how the heat you have applied would propagate through a block of your chosen material 1 m on a side, sped up by a factor of ~30. In this case, the temperature along the boundaries of the block is held constant at a lower temperature, simulating heat loss to the surroundings. Note how the applied heat interacts with the boundaries and with other patches of heated material.



COOL

TEMPERATURE

HOT


For the more mathematically/coding inclined, the plotted region represents a continuous solution of the 2-D heat conduction equation assuming a homogeneous thermal diffusivity value, subject to constant temperature Dirichlet boundary conditions. The heat conduction equation is solved utilizing a basic, fully explicit finite difference numerical approach for a 1 m2 region, with a spatial step of 0.01 m between grid nodes, and a time step equal to ½ the von Neumann stability criteria.


In order to generate the contour plot I employed the d3-contour package for d3js (https://observablehq.com/collection/@d3/d3-contour). While this package is capable of producing clean, appealing contour plots, it does require a coordinate system transformation for the finite difference discretization. Otherwise, it is a relatively straightforward, no frills way to create contour plots.







How odd are hot Jupiters? A binary logistic regression machine learning exercise - 4/28/2019

*All data used and codes generated as part of this blog entry are available to download on my github repository.

According to classical theories of solar system evolution, gas/ice giant planets are generally thought to form at greater orbital distances from their respective host stars than terrestrial planets. This is attributed to elevated temperatures and strong solar photodissociation of atmospheric and nebular gases near the star preventing condensation of gases and ices. However, the recent revolution in exoplanet detection (as facilitated by the Kepler mission and other technologies) has demonstrated the occurrence of so-called “hot Jupiter” gas giant planets in orbits remarkably close to their host stars (well within the orbit of Mercury). These observations have been accounted for by invoking gas giant planet formation at distance from the host star with subsequent inward orbital migration (e.g. the Nice and Grand tack models). Given the expanding wealth of data collected by current exoplanet detection missions, I have attempted in this exercise to evaluate the orbital distribution of gas-giant and non-gas giant planets in order to generate a predictive model to quantify the probability of the occurrence of “hot Jupiter” gas giant planets in near star orbits.


In order to perform this analysis I utilized exoplanet data obtained from the NASA Exoplanet Archive (http://exoplanetarchive.ipac.caltech.edu). I began by first developing a binary logistic regression model for a simple set of arbitrary test data with two classes (0 and 1) as a function of an independent variable (X) as shown in the plot below:

Figure 1: Arbitrary data generated to develop and test the binary logistic regression model. The data is divided between two classes (0 and 1) as a function of the independent variable (X). In this case, the data shows a tendency toward class 1 at greater values of X.


In order to fit the logistic function to the test data, I produced and applied three different gradient descent algorithms including batch, stochastic, and mini-batch. All delivered comparable results, and for the final results, I opted to use the mini-batch algorithm. The short movie clip below demonstrates the process of model optimization and fitting through the gradient descent algorithm:

Figure 2: (Left panel) Fit of the logistic function to the test data as a function of gradient descent iteration number. (Center panel) Logistic function parameters as a function of gradient descent iteration number. (Right panel) Log-loss/Cross-entropy of the logistic regression fit as a function of gradient descent iteration number, used to help identify an appropriate model learning rate.

Results of one run of the binary logistic regression analysis and logistic function fit to the test data are shown in the figures below. The final model fit generated through my model is shown against that obtained through use of the built-in scikit-learn Python function for comparison (note the close agreement at the chosen model tolerance).


Figure 3: (Left panel) Final converged fit of the logistic function to the test data compared to that obtained from the built-in scikit-learn logistic regression function. (Center panel) Logistic function parameters as a function of gradient descent iteration number illustrating convergence to the optimal values. (Right panel) Log-loss/Cross-entropy of the logistic regression fit as a function of gradient descent iteration number illustrating the error minimization process.

In order to assess the accuracy of the fitted binary logistic regression model I employed a Monte Carlo cross-validation technique. This was done by randomly excluding 10% of the test data set, fitting the model to the remaining 90%, then evaluating model accuracy in predicting the classes of the excluded 10%. Prediction accuracy was assessed by assuming a decision surface at P(class = 1|X) = 0.5, with all data points yielding probabilities exceeding this value assigned to class = 1. This operation was performed over a total of 25 iterations to yield a mean model cross-validation score. With respect to the test data, a mean Monte Carlo cross-validation score of 0.89 was achieved, with a score standard deviation of 0.09.


Having developed and validated the binary logistic regression model, I next applied the model to the exoplanet data of interest. I began by exploring and filtering the exoplanet dataset. In particular, I removed data points exhibiting any of the following characteristics: (1) Unrealistically high densities (>10,000 kg/m3) unattainable for common planetary building materials. (2) Unrealistically low densities (<100 kg/m3) requiring unreasonably high porosities or diffuse planetary bodies. (3) Density values that did not match those computed from the measured radius and mass. Filtration of the data according to these critieria resulted in the exclusion of all data points above the maximum and below the minimum density lines shown in the plot below as well as those exhibiting mismatched recorded densities.

Figure 4: Log-log plot of exoplanet radius vs. exoplanet mass colored by density. This plot illustrates a portion of the data exploration and filtration process. All data points above the maximum density line and below the minimum density line exhibiting unrealistic densities were removed prior to data analysis. Additionally all data points with recorded density values different than those computed from the measured mass and radius were also excluded (e.g. the blue data points above the maximum density line).

The filtered exoplanet data were then classified into the gas giant or non-gas giant classes in accordance with the following criteria:


Gas giant if: Planet mass > 15 Earth masses & Planet radius > 3 Earth radii & Mean planet density < 3,000 kg/m3

Else: Not gas giant


Following the classification of the data, fitting the binary logistic regression model yielded the results shown below:


Figure 5: (Left panel) Final converged fit of the logistic function to the classified exoplanet data (where class = 1 corresponds to gas giant planets) compared to that obtained from the built-in scikit-learn logistic regression function. (Center panel) Logistic function parameters as a function of gradient descent iteration number illustrating convergence to the optimal values. (Right panel) Log-loss/Cross-entropy of the logistic regression fit as a function of gradient descent iteration number illustrating the error minimization process.

Monte Carlo cross-validation returned a mean score of 0.63 and standard deviation of 0.03. Given the classification criteria applied above and the current catalogue of exoplanet data, the model predicts that is not unusual for gas giant planets to be found within close orbital distances from their host stars.


However, accounting for the large errors known to plague the exoplanet data by eliminating the density criteria and instead assuming any planet exceeding the above outlined mass and size criteria to be classified as a gas giant yields the following results:


Figure 6: (Left panel) Final converged fit of the logistic function to the classified exoplanet data (where class = 1 corresponds to gas giant planets) compared to that obtained from the built-in scikit-learn logistic regression function. (Center panel) Logistic function parameters as a function of gradient descent iteration number illustrating convergence to the optimal values. (Right panel) Log-loss/Cross-entropy of the logistic regression fit as a function of gradient descent iteration number illustrating the error minimization process.

For this data set, Monte Carlo cross-validation returned a mean score of 0.71 and standard deviation of 0.02. This clearly different trend indicates preferential occurrence of more massive and voluminous planets at greater orbital distances. Here it is worth noting, however, the influence of detection bias, as with current technologies, large planets are much more readily detected at large semi-major axis values than small planets. Therefore, in order to obtain more representative results, it will be important to re-perform this analysis as technologies improve and the current catalogue of exoplanet data is expanded and refined.


An important final point is to note the significant difference between the performance/efficiency of the binary logistic model developed in this exercise and that of the built-in scikit-learn Python logistic regression function. Application of the scikit-learn function to any given data set is several orders of magnitude faster than that of the developed model. Thus, for purposes of rapidly fitting a model and developing a classification function, the scikit-learn function is clearly preferable. However, the intention of the model developed in this blog entry is not to provide a hyper-efficient logistic regression function. Rather this model is intended as a teaching tool providing a clear, step-by-step presentation of each mathematical component of a complete logistic regression model translated into intuitive code.


A potentially interesting avenue of future work utilizing this data set could involve the application of clustering algorithms to predict planetary mass or radius when one or the other is known (along with semi-major axis distance). This potential is depicted by the plot below illustrating the clustered relationships between planetary mass, radius, and semi-major axis distance. Of particular use would be the ability to predict planetary mass given radius and semi-major axis distance, as direct measurement of planetary mass is considerably more challenging.

Figure 7: Log-log plot of exoplanet semi-major axis distnace vs. mass colored by radius. This plot illustrates potential clustering relationships that could be utilized to predict parameters such as exoplanet mass when radius and semi-major axis distance are known.







Compound Interest Calculator - 2/13/2019

*The complete code for this calculator can be found on my newly created github repository.

A calculator to predict the value of an investment vs. time given an initial principal, a time extent of interest, some regular principal contributions or withdrawals, and a range of desired possible interest rates.

I wrote this nifty calculator using JavaScript and the Chart.js library and feel it provides more functionality and insight than others I have seen online. In an initial version I attempted to plot the calculated data by directly drawing in an HTML canvas, but found it to be a bit clunky. I pretty quickly found the streamlined nature and simple API of Chart.js to be far preferable!


Input data:

Initial Principal ($):
Time Extent (yr.):
Contributions/Withdrawals (+-$/yr.):
Min. Interest Rate (%/yr.):
Max. Interest Rate (%/yr.):