The algorithm for LU factorization is given in
Figure 1.4 and the graphical representation of the
algorithm is given in Figure 1.5, as it would look part
way through the computation. The black square represents the current
pivot element. The horizontal shaded rectangle is a row from the upper
triangle of the matrix. The vertical shaded rectangle is a column from
the lower triangle of the matrix. The L and U labeled regions are
the portions of the matrix that have already been updated by the
algorithm. The algorithm has not yet reached the region labeled A'.
Figure 1.4:
LU factorization pseudocode.
 find maximum element in column(i) (starting below the diagonal)
 swap the row of maximum element with row(i)
 scale column(i) by 1/A(i,i)
 let A' = A(i:N,i:N)
 A' < A' + x y^T (rank one update)

Figure 1.5:
Diagram for LU factorization.

We create an upper and lower triangle view of the original matrix
A. We use typedefs to create the specific triangular
matrix type from the MTL class triangle_view. We then create the U and
L matrix objects and pass the matrix A to their
constructors. The L and U objects are just handles,
so their creation is inexpensive (constant time complexity). Every
component and algorithm in MTL is inside the mtl:: namesapce.
If you prefer not to have to specify the namespace each time you can
do using namespace mtl;.
typedef mtl::row_type<Matrix>::type RowMatrix;
typedef mtl::column_type<Matrix>::type ColumnMatrix;
typedef mtl::triangle_view<RowMatrix, unit_upper>::type Unit_Upper;
typedef mtl::triangle_view<ColumnMatrix, unit_lower>::type Unit_Lower;
Unit_Upper U(A);
Unit_Lower L(A);
We can access the rows and columns from L and U
through their TwoD iterators, which are created with the
begin() method. Dereferencing the TwoD iterators gives a row
or column.
Unit_Upper::iterator rowi = U.begin();
Unit_Lower::iterator columni = L.begin();
The first operation in the LU factorization is to find the maximum
element in the column. When we search for the maximum element, we want
to include the element in the pivot row, so we cannot use the
L triangle object defined above since it is unit
diagonal. Instead we create another lower triangle D. We can
then simply apply the max_index() function to the
dcolumn object.
typedef mtl::triangle_view<Matrix, lower>::type D(A);
Lower::iterator dcoli = D.begin();
jp = mtl::max_index(*dcoli);
The next operation in the LU factorization is to swap the current row
with the row that has the maximum element. The following code snippet
shows how to call the swap() function to interchange the
rows. The pivot variable is the row index for the current
pivot element. Note how rows from the matrix A are being
used as arguments in the swap() algorithm. In MTL, rows
and columns are first class vector objects (with some caveats for
sparse rows and columns). Similarly, submatrices in MTL are first
class matrix objects, and can be used in the same way as a matrix.
if (jp != j)
mtl::swap(rows(A)[j], rows(A)[jp]);
The third operation in the LU factorization is to scale the column
under the pivot by 1/A(i,i), which can be
performed with the scale() MTL
algorithm, which we apply to the column dereferenced by
*columni.
if (j < M  1)
mtl::scale(*columni, T(1) / A(j,j));
The final operation in the LU factorization is to update the
trailing submatrix according to A' = A' + x y^T A' =
A' + x y^{T}.
We first create the submatrix A' from matrix A using
the sub_matrix() method. The algorithm
rank_one_update() can then be applied to the submatrix. We
do not use *rowi and *columni directly because their
indexing is in terms of A and not subA. A future
release of MTL will provide a better way of adjusting the indices than
this method of copying.
subA = A.sub_matrix(j+1, M, j+1, N);
copy(*columni, c); copy(*rowi, r);
mtl::rank_one_update(subA, scaled(c, T(1)), r);
The complete implementation of this example LU factorization can be
found in lu.h.
The execution time of many linear algebra operations on modern
computer architectures can be decreased dramatically through blocking
to increase cache utilization [6,7].
In algorithms where repeated matrixvector operations are done ( rank_one_update() for instance), it is beneficial to convert
the algorithm to use matrixmatrix operations to introduces more
opportunities for blocking.
The LU factorization algorithm can be reformulated to use more
matrixmatrix operations [16]. First we
split matrix A into four submatrices, using a blocking factor of
r (i.e., A_{11} is r x r).
A = [  ]
[ A_11  A_12 ] r
[]
[ A_21  A_22 ] n  r
[  ]
r n  r
Then we formulate A = LU in terms of the blocks.
[ A_11 A_12 ] = [ L_11 0 ] [ U_11 U_12 ]
[ A_21 A_22 ] [ L_21 L_22 ] [ 0 U_22 ]
From this we can derive the following equations for the submatrices of
A. The matrix on the right shows the values that should occupy Aafter one step of the blocked LU factorization.
We find L_{11}, U_{11},
and L_{21} by applying the pointwise LU factorization
to the combined region of A_{11} and
A_{21}. We then calculate U_{12} with a
triangular solve applied to A_{12}. Finally we
calculate A*_{22} with a matrix product of
L_{21} and U_{12}. The algorithm is then
applied recursively to A*_{22} .
In the implementation of block LU factorization MTL can be used to
create a partitioning of the matrix into submatrices. The use of the
submatrix objects throughout the algorithm removes the redundant
indexing that a programmer would typically have to do to specify the
regions for each submatrix for each operation. The partition() method of the MTL
matrix cuts the original matrix along particular rows and columns. The
partitioning results in a new matrix whose entries are the
submatrices. In the code below we create the partitioning that
corresponds to Figure 1.7
with the matrix object Ap. The partitioning for
Figure 1.8 is created with
matrix object As.
int row_sep1[] = { j, j + jb };
int col_sep1[] = { j };
Matrix Ap = A.partition(array_to_vec(row_sep1),
array_to_vec(col_sep1));
int row_sep2[] = { j, j + jb };
int col_sep2[] = { j, j + jb };
Matrix As = A.partition(array_to_vec(row_sep2),
array_to_vec(col_sep2));
triangle_view<Matrix, unit_lower>::type L_11(As(1,1));
Figure 1.7 depicts the block factorization part
way through the computation. The matrix is divided up for the
pointwise factorization step. The region including A_{11} and
A_{21} is labeled A_{1}. Since there is pivoting involved, the rows
in the regions labeled A_{0} and A_{2} must be swapped according to
the pivots used in A_{1}.
Figure 1.7:
Pointwise step in block LU factorization.

The implementation of this step in MTL is very concise. The Ap(1,1)
submatrix object is passed to the lu_factorize() algorithm. We
then use the multi_swap() function on Ap(1,0) and Ap(1,2)
to pivot their rows to match Ap(1,1).
Pvector pivots(jb);
lu_factorize(Ap(1,1), pivots);
multi_swap(Ap(1,0), pivots);
multi_swap(Ap(1,2), pivots);
Once A_{1} has been factored, the A_{12} and A_{22} regions must
be updated. The submatrices involved are depicted in
Figure 1.8. The A_{12} region needs to be
updated with a triangular solve. This version of triangular solve can
be found in the matmat namespace of the
MTL.
Figure 1.8:
Update steps in block LU factorization.

To apply the tri_solve() algorithm to L_{11} and A_{12}, we
merely call the MTL routine and pass in the L_11 and As(1,2) matrix objects.
mtl::tri_solve(L_11, As(1,2), mtl::left_side());
The last step is to calculate A*_{22} with a
matrixmatrix multiply according to A*_{22} <
A_{22}  L_{21}U_{12}.
To implement this operation we use the As(1,2) and
As(2,1) matrix objects, which have been overwritten in the
previous steps with U_{12} and L_{21},
and make a call to the mult() function.
mtl::mult(scaled(As(2,1), T(1)), As(1,2), As(2,2));
The complete version of the MTL block LU factorization algorithm
is given in blocked_lu.h.
