Data Exploration Using Pig and Hive
Some of the questions that we asked from our dataset:
Q1. Number of Males and Females who got loans approved.
Q2. Maximum Loan amount term for approved loans.
Q3. Max loan amount for which property type?
Q4. Minimum applicant income and/or co-applicant income at which the any loan was approved at all?
Q5. Ratio between minimum loan amount and associated income
Q6. Ratio between maximum loan amount and associated income
Logistic regression is intended for binary (two-class) classification problems. It will predict the probability of an instance belonging to the default class, which can be snapped into a 0 or 1 classification.
Decision tree builds regression or classification models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.
Random forest Classifier : Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
Gradient Boosting Classifier:Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.