Math and Linear Algebra: Optimization

Mathematical optimization is good for a lot of things, among them finding the best variables for a model, e.g. a rotation R and a translation t for two point clouds. The way is a bit long, so we start with the 1D case, and go to 2D and then 3D.

We illustrate everything with plots from Python, more specifically Jupyter QtConsole, which you can install by sudo apt-get install python3-qtconsole jupyter-core and then start with jupyter qtconsole, where we configure SVG with %config InlineBackend.figure_format = 'svg'. You might also need to install numpy and matplotlib [doc] if not yet present.

1D Optimization

A function like x2 is 1D, and "optimization" means going to the minimum, but without knowing the function! We only know the value of the function at any place we ask.

import numpy as np; import matplotlib.pyplot as plt
f = lambda x : x ** 2.  # some "unknown" function
x = 1.5                 # some place on that function
X = np.linspace(-2, +2) # a lot of places
plt.plot(x, f(x), 'ro', X, f(X), 'b-')


So the blue function is only known very locally, e.g. at the red dot. Now we can try to walk to the minimum at (0, 0) which is luckily possible because the function is convex (in convex optimization, hammering a function until it becomes convex is a major goal). Our walk is called gradient descent, where we take an initial solution (above x = 1.5) and go "against the slope" aka the 1st derivative.

dx  = 0.1                                 # tiny variation in x
Px  = np.array([x - dx / 2, x + dx / 2])  # x-coords around x
Pfx = f(Px)                               # y-coords around f(x)
gx  = (Pfx[1] - Pfx[0]) / (Px[1] - Px[0]) # gradient around (x, f(x))

At this step, Python tells you that the slope gx is around +3, so x going right means going up, and going left means going down. We visualize the gradient/slope:

def gradient(f, x, dx): return (f(x + dx / 2) - f(x - dx / 2)) / dx
def slope(f, x, gx, r=.5): sR = np.linspace (-r, +r); sX = sR + x; sY = sR * gx + f(x); return sX, sY
plt.plot(x, f(x), 'ro', X, f(X), 'b-', *slope(f, x, gx), 'r-')


Ok, so time to update our "best solution" x with something better, by subtracting some portion of the slope.

s = 0.3        # choose a "good" step size
x = x - gx * s # go against the slope
plt.plot(x, f(x), 'ro', X, f(X), 'b-', *slope(f, x, gradient(f, x, dx)), 'r-')


So what did we just do? We assumed that f can be approximated by a linear function (the slope), then just followed that slope in reverse direction (down). This is called linearization. Finding a good step size is crucial and the subject of Levenberg-Marquardt and other algorithms. And formally, what we just did is compute argmin_x f(x), meaning "find the argument x that minimizes f(x)".

More: Calculus intuition (video series), Newton 2nd order optimization (12min video), explanatory Levenberg-Marquardt intuition.

2D Optimization

Now we enter the world of linear algebra (3Blue1Brown video series).

So much for background, now we start with a 2D function f(x,y)=x2+5*y2 and a point (x,y) that we want to move towards the minimum.

import numpy as np; import matplotlib.pyplot as plt
f = lambda x,y: (x**2) + 5*(y**2)
x, y = +2.0, -1.5          # point (x,y)
Xr = np.linspace(-3, +3)   # range of x
Yr = np.linspace(-2, +2)   # range of y
X, Y = np.meshgrid(Xr, Yr) # all x/y combinations
fig = plt.figure(); fig.clear()
ax  = fig.subplots(); ax.axis('equal')
def contour(ax, X, Y, f):
    ax.clear(); cs = ax.contour(X, Y, f(X,Y))
    ax.clabel(cs, inline=True, fontsize=8)
contour(ax, X, Y, f); ax.plot(x, y, 'ro')

... and if for some reason, the figure wont show, just type fig to redraw it.


This time, we can only determine partial derivatives:

def partial_gradient(f, x, y, dx, dy): # dx OR dy = 0
    return (f(x+dx/2, y+dy/2) - f(x-dx/2, y-dy/2)) / (dx+dy)
d = 0.1 # tiny variation in x or y
gx = partial_gradient(f, x, y, d, 0) # vary x
gy = partial_gradient(f, x, y, 0, d) # vary y

Python tells us that gx is +4 and thus right means higher; and gy is -15 and thus down means higher. Again, for minimization we should go into the opposite direction, so we show -gx, -gy and then the vector [-gx, -gy]T which indicates the direction to go.

def arrow(x, y, dx, dy, text, color, zoom=.1):
    zx, zy = dx*zoom, dy*zoom # arrow end point offset
    if not (dx == 0 and dy == 0):
        ax.arrow(x, y, zx, zy, color=color, head_width=.1, \
    if text is not None:
        sx, sy = np.sign(dx), np.sign(dy) # [-1, 0, +1]
        tx, ty = sx * 5, sy * 5 # text offset
        ha = ['right', 'center', 'left'][int(sx+1)]
        va = ['top', 'center', 'bottom'][int(sy+1)]
        ax.annotate(text, (x+zx, y+zy), (tx, ty), ha=ha, va=va, \
            textcoords='offset points', color=color, \
            bbox=dict(fc='w', ec='w', alpha=0.9, pad=1.))
contour(ax, X, Y, f); ax.plot(x, y, 'ro')
arrow(x, y, -gx, 0, '-gx', 'green')
arrow(x, y, 0, -gy, '-gy', 'blue' )
arrow(x, y, -gx, -gy, '-g', 'violet')


Note that -g does not point directly at the minimum because the dimensions are not uniformly scaled (i.e. ellipse not circle), this happens a lot with more natural parameters too. But still, -g points in the right direction, and again we go down with a "good" step size, this time in vector notation:

step_size = .08          # advanced magic
x_ = np.array([x, y])    # vector x at t=0
g_ = np.array([gx, gy])  # gradient vector
x_ = x_ - step_size * g_ # vector x at t=1

And compute the next step from here.

def partial_gradients(f, x_, d):
    n = len(x_)            # here: 2
    I = np.identity(n) * d # indexer
    return np.array([partial_gradient(f, *x_, *I[r]) for r in range(n)])
g_ = partial_gradients(f, x_, d)
contour(ax, X, Y, f); ax.plot(*x_, 'ro')
arrow(*x_, *-g_, '-g', 'violet')


... and again, we get a gradient direction that is sort-of-right in solving argmin_xy f(x, y) for us, read as "find arguments x and y that minimize f(x,y); or in vector version argmin_x_ f(x_).

2.1D Optimization, multiple Objectives

So far, we had only one objective, i.e. functions f:ℝn→ℝ1 where the target energy to minimize was exactly one value. In reality, we often have many data points that we wish to minimize against some model. Here, consider the contour plot from the last section lowered by 4, so that we can try to bring the vertexes of some triangle to the 0 level.

f = lambda x,y: (x**2) + 5*(y**2) - 4    # cuts z=0
P = np.array([[0, 0], [1, 1.2], [2, 0]]) # 3 triangle vertexes (red)
def point(x, y, text, color, marker='o'):
    ax.plot(x, y, marker=marker, color=color)
    if text is not None:
        ax.annotate(text, (x, y), (0, 5), ha='center', va='bottom', \
            textcoords='offset points', color=color, \
            bbox=dict(fc='w', ec='w', alpha=0.9, pad=1.))
def tri(P):
    ax.clear(); ax.axis('equal')
    contour(ax, X, Y, f)
    [point(*p_, "%.1f" % f(*p_), 'red') for p_ in P]
    ax.fill(*P.T, ec='r', ls='--', alpha=.5, fill=False)


One way to go is a model t=[tx, ty], i.e. a translation that shifts the triangle such that all three red vertexes (points) have a z value of 0 or near it. So our objective is argmin_t e(t, P, f) for a translation t on triangle points P evaluated with a distance function f. The energy to be minimized is e(t, P, f)=∑_{p∈P} f(p+t)2, i.e we go through all points in P, add the currently best translation t, and evaluate with f how far away that is from 0 (squared so negative f results turn into positive energy). Then we sum up all the (squared) distances; the smaller the better.

t_ = np.array([0, 0]) # initial solution
r = lambda t_, P, f: f(*(P + t_).T) # residual
e = lambda t_, P, f: np.sum(r(t_, P, f)**2) # energy
r0 = r(t_, P, f)
e0 = e(t_, P, f)

Initially e0 is roughly 33.6, which is plausible: the red vertexes have left-to-right z values of -4, +4.2 and 0 (see r0). So where to send t now? If the points p on the contour f themselves were the objective, we would compute the Jacobian matrix [ros] J (all partial derivatives) similar to before:

J = np.array([partial_gradients(f, p_, d) for p_ in P]) # df/dp; wrong objective

But we are on t (not on p anymore), minimizing the residual r, so it is not ∂f/∂p (above) but rather ∂r/∂t (below). This means we vary tx and ty, and see which effect this has on r, following Gauss-Newton with n=2 (number of translation parameters) and m=3 (number of observations = triangle vertexes); we expect J to be a 3x2 (m*n) matrix.

    ⌈ ∂r(p0)/∂tx  ∂r(p0)/∂ty ⌉
J = | ∂r(p1)/∂tx  ∂r(p1)/∂ty |
    ⌊ ∂r(p2)/∂tx  ∂r(p2)/∂ty ⌋

gr = lambda t_, d_, P, r, f : (r(t_ + d_/2, P, f) - r(t_ - d_/2, P, f)) / np.sum(d_, axis=-1)
d = 0.1 # tiny variation in x or y
gr_tx = gr(t_, np.array([d, 0]), P, r, f) # derive tx
gr_ty = gr(t_, np.array([0, d]), P, r, f) # derive ty

At this point, gr_tx (gradient of r in direction of tx) is [0, 2, 4] which is plausible because the three points P0, P1, P2 plus tx are progressively more "right on the graph"; and gr_ty (gradient of r in direction of ty) is [0, 12, 0] which is also plausible because only P1 plus ty is non-zero. Now for the Jacobian in one go:

def grs(t_, d, P, r, f):
    n, m = len(t_), len(P) # n=2 params, m=3 points
    T2 = np.tile(t_, [n*m, 1])                 # 6x2
    I2 = np.tile((np.identity(n) * d), [m, 1]) # 6x2
    P2 = P.repeat(n, axis=0)                   # 6x2
    return gr(T2, I2, P2, r, f).reshape(m, n)  # Jacobian
J = grs(t_, d, P, r, f)

    ⌈ 0  0 ⌉
J = | 2 12 |
    ⌊ 4  0 ⌋

With gradient descent (GD), we would now go:

step_size = .15 # advanced magic
g_ = np.average(J, axis=0) # gradient vector
t_ = t_ - step_size * g_   # go down
e1 = e(t_, P, f) # improved energy (GD)
Pa = np.average(P, axis=0) # triangle center
tri(P + t_)
arrow(*Pa, *t_, 'gd', 'gray', 1.)


... where e1 is 7.8 which is a lot better than the previous 33.6 energy. We averaged J over the points P, so each point had the same say in it; numerically equivalent to JT‧[⅓ ⅓ ⅓]T. Alternatively we can use the residual r as weight by going JT‧r(t_,P,f); so e.g. the happy P2 (0.0) has no influence, the unhappy P0 (-4.0) reverses, and the unhappy P0 (+4.2) plows ahead.

step_size = .01 # advanced magic
t_ = np.array([0, 0]) # reset
g_ =, P, f)) # weighted gradient
t_ = t_ - step_size * g_  # go down
e1 = e(t_, P, f) # improved energy (GD²)
tri(P + t_)
arrow(*Pa, *t_, 'gd2', 'gray', 1.)


... and e1 is 8.8 which is also much better than the previous 33.6 energy. The effective gradient g_ seems to prioritize ty more, which is somewhat plausible because of the steeper rise in y-direction. There seems to be no official name for this method (here: GD²), though with the r factor coming from deriving ½[f(x)]² [chain rule] instead of f(x) alone, we can at minimum say that least squares (instead of total variation) is involved in some capacity.

Gauss-Newton (GN) proposes another alternative step, namely (JTJ)-1JT‧r(t_,P,f) where (JTJ)-1 is supposed to approximate the Hessian (2nd derivatives) to include local curvature in some way that I am really unclear on. Also, apparently the three combined J are supposed to be the pseudo-inverse of J, whatever that implies.

step_size = 1.6 # advanced magic
t_ = np.array([0, 0]) # reset
JI = np.linalg.inv( # pseudo-inverse of J
g_ =, P, f)) # curved gradient
t_ = t_ - step_size * g_ # go down
e1 = e(t_, P, f) # improved energy (GN)
tri(P + t_)
arrow(*Pa, *t_, 'gn', 'gray', 1.)


Under Gauss-Newton, e1 is 9.3 which is again a lot better than the previous 33.6 energy, and ty seems to be favored even more. Like gradient descent, the crux is of course the step size, but regardless of optimization method there is no way around it. The full algorithm can look like this:

def gauss_newton(t_, P, e, r, f, iter=[]):
    step_sizes = 2 ** np.arange(10, 0, -1) / 100 # [5.12..0.01]
    t0 = np.copy(t_); iter.append(t0)
    e0 = e(t0, P, f)
    improvement = 999.
    # can we improve t?
    while improvement > 0.1:
        J = grs(t0, d, P, r, f)
        JI = np.linalg.inv(
        g_ =, P, f))
        # best step size?
        t1 = t0 - step_sizes[0] * g_
        e1 = e(t1, P, f)
        for step in step_sizes[1:]:
            t2 = t0 - step * g_
            e2 = e(t2, P, f)
            if e2 > e1: # getting worse
                t1, e1 = t2, e2
        # anyone better?
        improvement = e0 - e1
        if improvement > 0:
            t0, e0 = t1, e1
    return t0
iter0 = []
t_ = np.array([0, 0]) # reset
t_ = gauss_newton(t_, P, e, r, f, iter0)
e1 = e(t_, P, f) # improved energy (GN)
tri(P + t_)
[arrow(*(Pa+iter0[i-1]), *(iter0[i]-iter0[i-1]), None, 'gray', 1.) for i in range(1,len(iter0))]


With e1 as 7.6 the resulting energy is better than ever before, even better than what I would have expected manually:

t3 = np.array([-1, -.6])
e3 = e(t3, P, f)
arrow(*Pa, *t3, 'manual', 'gray', 1.)


Here, e3 is 7.7 which is indeed slightly worse than the Gauss-Newton energy.

One last thing: We evaluated r numerically for partial derivatives, which is the right thing for most problems; but sometimes explicit derivation is also possible:

∂r/∂tx = ∂f(P+t)/∂tx = [(Px+tx)2+5(Py+ty)2 - 4]/∂tx = (Px2+2Pxtx+tx2)/∂tx = 2Px+2tx
∂r/∂ty = ∂f(P+t)/∂ty = [(Px+tx)2+5(Py+ty)2 - 4]/∂ty = (5Py2+10Pyty+5ty2)/∂ty = 10Py+10ty

... so in these friendly cases, evaluating r can at least be simplified.

More: Further improved variant Levenberg-Marquardt (LM), with (JTJ+λ‧diag(JTJ))-1JT‧r(t_,P,f) instead of (JTJ)-1JT‧r(t_,P,f) with a "dampening factor" (or relative weighting) λ combining Gauss-Newton (GN) and gradient descent (GD); Jacobian in robotics with J relating joints to end-effector, instead of translation t to points P.

3D Optimization

We now move away from continuous functions and concentrate fully on multiple objectives, namely two pyramids that we want to overlay in 3D; one is fixed and the other one is first rotated by a 3x3 matrix R and then translated by a 3D vector t. We use mplot3d (a submodule of matplotlib) for visualization, starting with the first pyramid:

import numpy as np; import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d # for 3d plots
V0 = np.array([[-1, -1, 0], [+1, -1, 0], [+1, +1, 0], [-1, +1, 0], [0, 0, 2], [0, 0, 1]]) # vertexes
Ti = np.array([[0, 1, 3], [3, 1, 2], [0, 1, 4], [1, 2, 4], [2, 3, 4], [3, 0, 4]]) # triangles
fig = plt.figure(); fig.clear()
ax = fig.add_subplot(111, projection='3d')
def ax3lim(x, y, z, r):
    ax.set_xlim3d(x-r, x+r); ax.set_xlabel('x')
    ax.set_ylim3d(y-r, y+r); ax.set_ylabel('y')
    ax.set_zlim3d(z-r, z+r); ax.set_zlabel('z')
def wire(V, Ti, color, line='-', a=.5):
    Ti2 = Ti if 2 == Ti.ndim else np.expand_dims(Ti, axis=0)
    Li1 = np.roll(Ti2.repeat(2, axis=1), -1, axis=1)
    Li2 = np.sort(Li1.reshape((-1, 2)), axis=1)
    Li3 = np.unique(Li2, axis=0)
    [ax.plot(*V[li].T, c=color, alpha=a, ls=line) for li in Li3]
    ax3lim (0, 0, 1, 1.5)
wire(V0, Ti, 'blue')


... and create a second pyramid by rotating and translating the first; our optimization will then strive to get the inverse of that R and t (or more formally, 6 parameters [a, b, c, tx, ty, tz] that describe them). Building the rotation matrix from angles a, b, c is not entirely trivial.

cos a cos b cos a sin b sin c - sin a cos c cos a sin b cos c + sin a sin c
R = | sin a cos b sin a sin b sin c + cos a cos c sin a sin b cos c - cos a sin c |
- sin b cos b sin c cos b cos c

def rotmat(a, b, c): # rz, ry, rx
    sa, sb, sc = np.sin([a, b, c])
    ca, cb, cc = np.cos([a, b, c])
    row0 = [ca*cb, ca*sb*sc - sa*cc, ca*sb*cc + sa*sc]
    row1 = [sa*cb, sa*sb*sc + ca*cc, sa*sb*cc - ca*sc]
    row2 = [-sb, cb*sc, cb*cc]
    return np.array([row0, row1, row2])
angles0 = np.array([5, 10, 15]) # rz, ry, rx
R0 = rotmat(*np.radians(angles0))
t0_ = np.array([0.6, 0.4, 0.2])
V1 = + t0_
wire(V1, Ti, 'red')


When the two pyramids are not too far apart, this is the classical Iterative Closest Point (ICP) case in the field of point cloud registration. The 2020 variants mostly use "point-to-plane", i.e. some vertexes on the red pyramid are pulled to the closest triangle of the blue pyramid. Since the red vertexes are probably not the best choice to be pulled (they are all on corners, since we have no tessellation or subdivision whatsoever), we replace them with vertexes centered on the triangles.

P1 = np.average(V1[Ti], axis=1)
wire(V0, Ti, 'blue')
wire(V1, Ti, 'red')
ax.plot(*P1.T, 'rx')


The red crosses need to be tested for distance to the corresponding triangle, using point-to-plane distance [wolfram]. Finding the correct or best "partner triangle" is a major goal in surface or point registration; here we cheat because we already know the point-to-triangle correspondence.

def triangle_distance(p_, V, ti_):
    tri = V[ti_]; assert tri.shape == (3, 3) # 3 vertexes
    n = np.cross(tri[1] - tri[0], tri[2] - tri[0]) # plane normal
    n = n / np.linalg.norm(n, axis=0) # normalize
    plane_dist = - tri[0])
    pp_ = p_ - plane_dist * n
    return np.abs(plane_dist), pp_
triangle_distance(P1[0], V0, Ti[0]) # testing
def any_triangle_distance(p_, V, Ti): # unused
    dist_list = [triangle_distance(p_, V, Ti[i]) + (i,) for i in range(len(Ti))]
    min_i = np.argmin([a[0] for a in dist_list])
    return dist_list[min_i]
any_triangle_distance(P1[3], V0, Ti) # testing
def partner_triangle_distance(P1, P1_i, V, Ti, pti_=None):
    p_ = P1 if 1 == P1.ndim else P1[P1_i]
    ti = P1_i if pti_ is None else pti_[P1_i]
    return triangle_distance(p_, V, Ti[ti]) + (ti,)
dist, pp_, ti = partner_triangle_distance(P1, 0, V0, Ti)

So point P1[0] = (0.3, 0.05, 0.2), in the middle of the red left floor triangle, is projected onto nearest triangle ti= 0, the blue left floor triangle. The intersection point pp_ = (0.3, 0.05, 0.0) has point-to-plane distance dist = 0.17.

def arrow3(p0_, p1_, color='gray'):
    P0 = p0_ if 2 == p0_.ndim else np.expand_dims(p0_, 0)
    P1 = p1_ if 2 == p1_.ndim else np.expand_dims(p1_, 0)
    Q = P1 - P0
    ax.quiver(*P0.T, *Q.T, color=color, alpha=0.8)
    return P0, Q
def wirepyr(V0, V1, Ti):
    wire(V0, Ti, 'blue', ':')
    wire(V1, Ti, 'red', ':')
def wiretri(P1, P1_i, pp_, V0, V1, Ti, V0_i):
    wirepyr(V0, V1, Ti)
    ax.plot(*P1.T, 'rx')
    ax.plot(*P1[P1_i:P1_i+1].T, 'gx')
    arrow3(P1[P1_i], pp_)
    wire(V1, Ti[P1_i], 'green', (0, (5, 5)), a=0.8)
    wire(V0, Ti[V0_i], 'green', (0, (5, 5)), a=0.8)
wiretri(P1, 0, pp_, V0, V1, Ti, ti)


We can see on the ground plate that the miniscule distance (down the z-axis) is correct, but it is not the most interesting triangle: This would be the right-most one, with point and triangle index 3.

dist, pp_, ti = partner_triangle_distance(P1, 3, V0, Ti)
wiretri(P1, 3, pp_, V0, V1, Ti, ti)


With the point-to-triangle distances, we now write the (somewhat complex) residual r for one point, as well as the energy function e which sums up over all points.

rt1 = np.zeros(6) # desired a, b, c, tx, ty, tz
def r(rt_, P, i, V, Ti, pti_=None):
    R, t_ = rotmat(*rt_[:3]), rt_[3:]
    q_ =[i]) + t_ # near blue
    dist_pp_ti = partner_triangle_distance(q_, i, V, Ti, pti_)
    return dist_pp_ti[0]
def e(rt_, P, V, Ti, pti_=None):
    residuals = np.array([r(rt_, P, i, V, Ti, pti_) for i in range(len(P))])
    return np.sum(residuals ** 2)
r1 = r(rt1, P1, 0, V0, Ti)
e1 = e(rt1, P1, V0, Ti)

For the first point in P1, the distance to the triangle plane r1 is 0.17 and the least squares energy e1 (over all points in P1) is 0.98, so we seem to be not too far away. Now comes the hardest part, determining the Jacobian J.

    ⌈ ∂r(p0)/∂a  ∂r(p0)/∂b  ∂r(p0)/∂c  ∂r(p0)/∂tx  ∂r(p0)/∂ty  ∂r(p0)/∂tz ⌉
J = | ∂r(p1)/∂a  ∂r(p1)/∂b  ∂r(p1)/∂c  ∂r(p1)/∂tx  ∂r(p1)/∂ty  ∂r(p1)/∂tz |
    ⌊    ...        ...        ...        ...         ...         ...     ⌋

As in 2D, we vary tiny distances d for all in [a, b, c, tx, ty, tz].

def gr3(rt_, d_, r, P, i, V, Ti, pti_=None):
    assert 2 == P.ndim
    assert 3 == P.shape[-1]
    assert rt_.shape == d_.shape
    rt0, rt1 = rt_ + d_/2, rt_ - d_/2
    dr0 = r(rt0, P, i, V, Ti, pti_)
    dr1 = r(rt1, P, i, V, Ti, pti_)
    return (dr0 - dr1) / np.sum(d_)
gr3(rt1, np.array([0, 0, 0, 1e-8, 0, 0]), r, P1, 3, V0, Ti) # testing

... and with a gradient of +0.89 on tx, it seems that the vertex P1[3] on the rightmost triangle can benefit from going left (against the gradient, as usual). Now on to all partial derivatives:

def grs3(rt_, d, r, P, i, V, Ti, pti_=None):
    I = np.identity(len(rt_)) * d
    L = [gr3(rt_, d_, r, P, i, V, Ti, pti_) for d_ in I]
    return np.array(L)
np.set_printoptions(precision=2, linewidth=100)
grs3(rt1, 1e-8, r, P1, 3, V0, Ti)

... and with a gradient (-0.26, +0.03, +0.13) for rotation and (+0.89, 0, +0.45) for translation, it seems that for P1[3] rotation around z- and x-axis are more meaningful than around y-axis; and that shifts in x- and z-direction make more sense than shifting in y-direction. Both kinda make sense if you look at the most recent image above.

def jac3(rt_, r, P, V, Ti, pti_=None):
    L = [grs3(rt_, 1e-8, r, P, i, V, Ti, pti_) for i in range(len(P))]
    return np.array(L)
J1 = jac3(rt1, r, P1, V0, Ti)

        a      b      c      tx     ty     tz
    ⌈ +0.00  -0.29  +0.05  +0.00  +0.00  +1.00 ⌉
J = | +0.00  -0.91  +0.75  +0.00  +0.00  +1.00 |
    ⌊  ...    ...    ...    ...    ...    ...  ⌋

For the first two lines (representing the two bottom triangles), a being 0 means that both triangles are indifferent to rotations around the z-axis; tx and ty being 0 indicates a similar indifference to shifts in the xy-plane. So let us try Gauss-Newton again:

def gn3(rt_, r, e, P, V, Ti, pti_=None, iter=[], d=1e-8):
    step_sizes = 2 ** np.arange(10, 0, -1) / 100 # [5.12..0.01]
    t0 = np.copy(rt_); iter.append(t0)
    e0 = e(t0, P, V, Ti, pti_)
    improvement = 999.
    # can we improve rt?
    while improvement > 0.1:
        J = jac3(t0, r, P, V, Ti, pti_)
        JI = np.linalg.inv(
        r_ = np.array([r(t0, P, i, V, Ti, pti_) for i in range(len(P))])
        g_ =
        # best step size?
        t1 = t0 - step_sizes[0] * g_
        e1 = e(t1, P, V, Ti, pti_)
        for step in step_sizes[1:]:
            t2 = t0 - step * g_
            e2 = e(t2, P, V, Ti, pti_)
            if e2 > e1: # getting worse
                t1, e1 = t2, e2
        # anyone better?
        improvement = e0 - e1
        if improvement > 0:
            t0, e0 = t1, e1
    return t0
pti = np.arange(0, len(P1)) # point-triangle indexer, 1:1
iter3 = [] # step-wise rt_
rt1 = np.zeros(6) # reset
rt2 = gn3(rt1, r, e, P1, V0, Ti, pti, iter3)
e2 = e(rt2, P1, V0, Ti) # improved energy (GN)

After optimizing, e2 is 0.003 which is a lot better than the 0.98 from the beginning. Now let us see where the optimized rotation/translation parameters rt2, at [0.01, -0.06, -0.13, -0.6, -0.37, -0.17] with a big shift in x, took the pyramid:

def proj3(rt_, V):
    R, t_ = rotmat(*rt_[:3]), rt_[3:]
    return + t_
def arrowproj3(iter, p_):
    P = np.array([proj3(rt_, p_) for rt_ in iter])
    Q = np.roll(P, -1, axis=0) - P
    ax.quiver(*P[:-1].T, *Q[:-1].T, color=['gray', 'green'],
              arrow_length_ratio=.1, alpha=.5)
def wireproj3(rt_, V0, V1, Ti, iter):
    V2 = proj3(rt_, V1)
    wirepyr(V0, V2, Ti)
    if iter is not None:
        arrowproj3(iter, V1[-1])
wireproj3(rt2, V0, V1, Ti, iter3)


While somewhat close, it also shows that our sampling was obviously not optimal. A closer look:

P2 = proj3(rt2, P1)
dist_pp_ti2 = [partner_triangle_distance(P2, i, V0, Ti) for i in range(len(P2))]
PP2 = np.array([a[1] for a in dist_pp_ti2])
ax.plot(*P2.T, 'rx')
[arrow3(P2[i], PP2[i]) for i in range(len(P2))]


... and the point-to-plane arrows are barely visible, confirming that Gauss-Newton did it right (verification); but we did not do the right thing (validation). Maybe adding more samples can help: One on the ground plate to fix z, and two on the sides to fix xy.

p_floor_ = V1[1] + 0.2 * (V1[3] - V1[1])
p_side1_ = V1[1] + 0.2 * ((V1[0] + V1[4])/2 - V1[1])
p_side2_ = V1[1] + 0.2 * ((V1[2] + V1[4])/2 - V1[1])
P3 = np.vstack([P1, p_floor_, p_side1_, p_side2_])
pti3 = np.hstack([pti, 0, 2, 3]) # point-triangle matches
partner_triangle_distance(P3, 8, V0, Ti, pti3) # testing
iter3 = []
rt3 = gn3(rt1, r, e, P3, V0, Ti, pti3, iter3)
e3 = e(rt3, P3, V0, Ti, pti3) # improved energy (GN)
wireproj3(rt3, V0, V1, Ti, iter3)
ax.plot(*proj3(rt3, P3).T, 'rx', alpha=.5)


And they are finally matched. Even though we did not normalize the energy (i.e. divide by number of points), e3 with 0.0007 is much better than the previous 0.0030. And so we learned that choosing samples has quite an influence!


And that concludes our introduction to mathematical optimization. Though a bit lengthy, I hope you were able to get some intuition for the field, something that I found to be quite a challenge. Thanks for leaving at least two brain cells here, and have a fun day!

EOF (Mar:2021)