Python PCA – Python Principal Component Analysis

Python PCA – Python with Principal Component Analysis

Python PCA. Principal Component Analysis (PCA) is a statistical technique for converting a set of possibly correlated observations into a set of linearly uncorrelated values. Each of the principal components is chosen to characterise the majority of the remaining variance, and all of the principal components are orthogonal to one another. The first principal component has the most variance of all the principal components.

PCA’s applications include:

  • It’s used to figure out how variables in a dataset are related.
  • It’s a tool for interpreting and visualising data.
  • As the number of variables decreases, further analysis becomes easier.
  • It’s frequently used to show how genetic distance and relatedness between populations are shown.

These are typically carried out on square symmetric matrices. A straightforward sums of squares and cross products matrix, a Covariance matrix, or a Correlation matrix are all possibilities. If the individual variance differs significantly, a correlation matrix is utilized.

PCA’s Goals and Objectives

  • It’s a non-dependent method that lowers the number of variables in an attribute space to a smaller number of factors.
  • Although PCA is essentially a dimension reduction method, there is no assurance that the dimension will be interpretable.
  • The main goal of this PCA is to pick a subset of variables from a bigger set based on which of the original variables has the strongest connection with the primary amount.

Method of the Principal Axis

PCA is used to find a linear combination of variables from which we can extract the most variance. When this procedure is finished, it eliminates it and looks for another linear combination that explains the largest proportion of residual variance, which leads to orthogonal factors. We look at total variance using this technique.

Eigenvector:

After matrix multiplication, it is a non-zero vector that remains parallel. If Mx and x are parallel, let’s say x is an eigen vector of size r of matrix M of dimension r*r. Then, to obtain the eigen vector and eigen values, we must solve Mx=Ax, where both x and A are unknown.

Principal components, as defined by Eigen-Vectors, indicate both shared and unique variation in a variable. It’s a variance-focused strategy that aims to recreate overall variance and correlation across all components. The main components are essentially linear combinations of the original variables that are weighted according to their contribution to explaining variation in a specific orthogonal dimension.

Eigen Values

Eigen Values are sometimes referred to as characteristic roots. It essentially quantifies the variation in all variables that that factor accounts for. The eigenvalue ratio is a measure of the factors’ explanatory power in relation to the variables. When a component has a low value, it contributes less to the explanation of variables. In basic terms, it calculates the percentage of variation in a particular database that is accounted for by the factor. The total of the factor’s squared factor loading for all variables may be used to compute the factor’s eigen value.

1st Step:  Libraries Importing Python PCA

# importing libraries
>>> import numpy as np
>>> import math
>>> math.pi
>>> import matplotlib.pyplot as plt
>>> import pandas as pd

2nd Step: Dataset Importing For data analysis

import the dataset and distribute it into X and Y components.

# importing the dataset or loading the dataset
dataset = pd.read_csv('wines.csv')
# distributing the dataset into two components X and Y
X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values

3rd Step: Creating the Training and Test sets from the dataset

# Splitting the X and Y into the
# Training set and Testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

4th Step: Scaling of Features

Fitting the Standard scale to the training and testing set as part of the pre-processing.

# preprocessing performing part 
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

5th Step: Using the PCA function

Using the PCA tool to analyse the training and testing sets.

# Using the PCA function training
# X component testing set
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_

6th Step: Logistic Regression Fitting To the training set

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

7th Step: Predicting the outcome of the test set

# Predicting the test set result using
# predict function under LogisticRegression
y_pred = classifier.predict(X_test)

8th Step: Creating confusion matrix 

# making confusion matrix between
# test set of Y and predicted value.
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

9th Step: Predicting the outcome of a training set

# Predicting the training set
# result through scatter plot
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1,
					stop = X_set[:, 0].max() + 1, step = 0.01),
					np.arange(start = X_set[:, 1].min() - 1,
					stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(),
			X2.ravel()]).T).reshape(X1.shape), alpha = 0.75,
			cmap = ListedColormap(('yellow', 'white', 'aquamarine')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
	plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
				c = ListedColormap(('red', 'green', 'blue'))(i), label = j)
plt.title('Logistic Regression (Training set)')
plt.xlabel('PC1') # for Xlabel
plt.ylabel('PC2') # for Ylabel
plt.legend() # to show legend
# show scatter plot
plt.show()
Logistic regression Training set
Python PCA – Logistic regression Training set

10th Step: Creating a visual representation of the test set results

from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1,
					stop = X_set[:, 0].max() + 1, step = 0.01),
					np.arange(start = X_set[:, 1].min() - 1,
					stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(),
			X2.ravel()]).T).reshape(X1.shape), alpha = 0.75,
			cmap = ListedColormap(('yellow', 'white', 'aquamarine')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
	plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
				c = ListedColormap(('red', 'green', 'blue'))(i), label = j)
# title for scatter plot
plt.title('Logistic Regression (Test set)')
plt.xlabel('PC1') # for Xlabel
plt.ylabel('PC2') # for Ylabel
plt.legend()
# show scatter plot
plt.show()
PC2 Logistic regression Training set
Python PCA – Logistic regression Training set

Leave a Reply

Your email address will not be published. Required fields are marked *