Python PCA. Principal Component Analysis (PCA) is a statistical technique for converting a set of possibly correlated observations into a set of linearly uncorrelated values. Each of the principal components is chosen to characterise the majority of the remaining variance, and all of the principal components are orthogonal to one another. The first principal component has the most variance of all the principal components.
PCA’s applications include:
- It’s used to figure out how variables in a dataset are related.
- It’s a tool for interpreting and visualising data.
- As the number of variables decreases, further analysis becomes easier.
- It’s frequently used to show how genetic distance and relatedness between populations are shown.
These are typically carried out on square symmetric matrices. A straightforward sums of squares and cross products matrix, a Covariance matrix, or a Correlation matrix are all possibilities. If the individual variance differs significantly, a correlation matrix is utilized.
PCA’s Goals and Objectives
- It’s a non-dependent method that lowers the number of variables in an attribute space to a smaller number of factors.
- Although PCA is essentially a dimension reduction method, there is no assurance that the dimension will be interpretable.
- The main goal of this PCA is to pick a subset of variables from a bigger set based on which of the original variables has the strongest connection with the primary amount.
Method of the Principal Axis
PCA is used to find a linear combination of variables from which we can extract the most variance. When this procedure is finished, it eliminates it and looks for another linear combination that explains the largest proportion of residual variance, which leads to orthogonal factors. We look at total variance using this technique.
Eigenvector:
After matrix multiplication, it is a non-zero vector that remains parallel. If Mx and x are parallel, let’s say x is an eigen vector of size r of matrix M of dimension r*r. Then, to obtain the eigen vector and eigen values, we must solve Mx=Ax, where both x and A are unknown.
Principal components, as defined by Eigen-Vectors, indicate both shared and unique variation in a variable. It’s a variance-focused strategy that aims to recreate overall variance and correlation across all components. The main components are essentially linear combinations of the original variables that are weighted according to their contribution to explaining variation in a specific orthogonal dimension.
Eigen Values
Eigen Values are sometimes referred to as characteristic roots. It essentially quantifies the variation in all variables that that factor accounts for. The eigenvalue ratio is a measure of the factors’ explanatory power in relation to the variables. When a component has a low value, it contributes less to the explanation of variables. In basic terms, it calculates the percentage of variation in a particular database that is accounted for by the factor. The total of the factor’s squared factor loading for all variables may be used to compute the factor’s eigen value.
1st Step: Libraries Importing – Python PCA
# importing libraries >>> import numpy as np >>> import math >>> math.pi >>> import matplotlib.pyplot as plt >>> import pandas as pd
2nd Step: Dataset Importing For data analysis
import the dataset and distribute it into X and Y components.
# importing the dataset or loading the dataset dataset = pd.read_csv('wines.csv') # distributing the dataset into two components X and Y X = dataset.iloc[:, 0:13].values y = dataset.iloc[:, 13].values
3rd Step: Creating the Training and Test sets from the dataset
# Splitting the X and Y into the # Training set and Testing set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
4th Step: Scaling of Features
Fitting the Standard scale to the training and testing set as part of the pre-processing.
# preprocessing performing part from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test)
5th Step: Using the PCA function
Using the PCA tool to analyse the training and testing sets.
# Using the PCA function training # X component testing set from sklearn.decomposition import PCA pca = PCA(n_components = 2) X_train = pca.fit_transform(X_train) X_test = pca.transform(X_test) explained_variance = pca.explained_variance_ratio_
6th Step: Logistic Regression Fitting To the training set
from sklearn.linear_model import LogisticRegression classifier = LogisticRegression(random_state = 0) classifier.fit(X_train, y_train)
7th Step: Predicting the outcome of the test set
# Predicting the test set result using # predict function under LogisticRegression y_pred = classifier.predict(X_test)
8th Step: Creating confusion matrix
# making confusion matrix between # test set of Y and predicted value. from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred)
9th Step: Predicting the outcome of a training set
# Predicting the training set # result through scatter plot from matplotlib.colors import ListedColormap X_set, y_set = X_train, y_train X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01), np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01)) plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, cmap = ListedColormap(('yellow', 'white', 'aquamarine'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green', 'blue'))(i), label = j) plt.title('Logistic Regression (Training set)') plt.xlabel('PC1') # for Xlabel plt.ylabel('PC2') # for Ylabel plt.legend() # to show legend # show scatter plot plt.show()
10th Step: Creating a visual representation of the test set results
from matplotlib.colors import ListedColormap X_set, y_set = X_test, y_test X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01), np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01)) plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, cmap = ListedColormap(('yellow', 'white', 'aquamarine'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green', 'blue'))(i), label = j) # title for scatter plot plt.title('Logistic Regression (Test set)') plt.xlabel('PC1') # for Xlabel plt.ylabel('PC2') # for Ylabel plt.legend() # show scatter plot plt.show()