If you have any doubts in the below, contact us by dropping a mail to the Kung Fu Panda. We will get back to you very soon.

- most widely used machine learning concept.
- model/learning is essentially a tree structure.
- convert the decision making into logical tree structures which are easy to understand.
- nodes denote the condition while branches denote the action/value of those condition.
- use a divide and conquer approach to classification.

- flowchart/tree is in human readable format, so that there is real learning in the system.
- very widely used for a large num of applications

- should not be used when you have a large no of nominal features with many levels or large num of numerical features.
- the above can lead to a complex overfitted tree.
- biased towards splitting on features having large number of levels

- measure of disorderness in a system.
- sum(-p log p)
- where p is the probability of data being for a class.

- eg if a set has 5 males and 5 females, the entropy will be
- -0.5*log 0.5 - 0.5*log 0.5 = -log 0.5 = log 2 = 1

- eg if a set has 5 males and 10 females, the entropy will be
- -0.33 log 0.33 - 0.67 log 0.67 = -0.33 * (-1.599) - 0.67*(-0.577) = 0.52767+0.387 = 0.91467

- eg if a set has 5 males and 0 females, the entropy will be
- -1 log 1 - 0 log 0 = 0
- there is no disorderness as all the records are of one type one(male)

- the value of entropy goes from 0 to 1.
- If data is uniform, like either all males or all females, the entropy(disturbance) is 0.
- If the data is equally distributes, there is disturbance(as we cannot make good guesses on it), the entropy(disturbance) is 1.

- the measure of reduction of entropy in the system.
- information_gain = initial_entropy - final_entropy

- recusively split the data based on some conditions and features.
- which feature to split the data on?
- whichever feature predicts the target class in the best way!!!
- whichever feature provides best "information gain"

- implementation of decision tree algo.

- performs well on a large number of problems.
- can exclude unimportant features.
- the resulting tree is easily understandable.
- efficient that a number of other models.
- can handle numeric or nominal features or missing data.
- can also give the ratio of types of results(false positive, false negative etc) to minimize costly mistakes.

- small changes in training data can lead to completely different tree leading to a completely different logical understanding.
- biased towards splitting on features having a large num of levels.
- decision trees mostly use axis-parallel splits, that is each node is a condition on one and only one feature only.
- sometimes overfits when a large no of numeric features or nominal features with a large no of levels.

- reducing the height of the tree by removing some nodes
- done to avoid overfitting
- pre-pruning
- pruning before the tree grows fully, during analysis

- post-pruning
- pruning after the tree grows fully
- considered better because we analyse all data and then prun
- is obviously slower than pre-pruning.

- can be done using confusion matrix
- confusion matrix shows the types of results like true positive, true negative, false positive, false negative.

- the predictions of decision tree can be improved by boosting.

- boosting creates a strong learner from many weak learners.
- boosting is added by specifying a 'trial' factor in the algo.
- 'trials' is the number of decision trees to be used for creating a strong learner.
- normally 'trials' are given as 10.

- more time consuming because we need to build a number of trees.
- does not work well with noisy data.

- similar to decision tree
- use separate and conquer strategy