2. Principal Component Analysis
PCA Algorithm for Feature Extraction
The following represents 6 steps of principal component analysis (PCA) algorithm:
Standardize the dataset: Standardizing / normalizing the dataset is the first step one would need to take before performing PCA. The PCA calculates a new projection of the given data set representing one or more features. The new axes are based on the standard deviation of the value of these features. So, a feature / variable with a high standard deviation will have a higher weight for the calculation of axis than a variable / feature with a low standard deviation. If the data is normalized / standardized, the standard deviation of all fetaures / variables get measured on the same scale. Thus, all variables have the same weight and PCA calculates relevant axis appropriately. Note that the data is standardized / normalized after creating training / test split. Python’s sklearn.preprocessing StandardScaler class can be used for standardizing the dataset. Construct the covariance matrix: Once the data is standardized, the next step is to create n X n-dimensional covariance matrix, where n is the number of dimensions in the dataset. The covariance matrix stores the pairwise covariances between the different features. Note that a positive covariance between two features indicates that the features increase or decrease together, whereas a negative covariance indicates that the features vary in opposite directions. Python‘s Numpy cov method can be used to create covariance matrix. Perform Eigendecomposition of covariance matrix: The next step is to decompose the covariance matrix into its eigenvectors and eigenvalues. The eigenvectors of the covariance matrix represent the principal components (the directions of maximum variance), whereas the corresponding eigenvalues will define their magnitude. Numpy linalg.eig or linalg.eigh can be used for decomposing covariance matrix into eigenvectors and eigenvalues. Selection of most important Eigenvectors / Eigenvalues: Sort the eigenvalues by decreasing order to rank the corresponding eigenvectors. Select k eigenvectors, which correspond to the k largest eigenvalues, where k is the dimensionality of the new feature subspace (). One can used the concepts of explained variance to select the k most important eigenvectors. Projection matrix creation of important eigenvectors: Construct a projection matrix, W, from the top k eigenvectors. Training / test dataset transformation: Finally, transform the d-dimensional input training and test dataset using the projection matrix to obtain the new k-dimensional feature subspace.