grouping customer behaviour for targeted advertising.
prediction of election/matches outcomes.
prediction of areas with likely criminal activity.
discovery of genetic sequences linked to diseases.
for solving a problem, we divide the experiences into a concept like you wouldn't expect exact same problem/experience which you got earlier, but a little different, so you try to learn on the concept.
The concept is what is called the model in ML.
the process of fitting a model to a dataset, may be to tune the model and get the best coefficients.
it is called training and not learning, because machines cannot learn and decide on their own, they have to be trained.
store data so that it can be analysed.
making models for data representation
models can be represented by
diagrams like trees
logical if else rules
grouping of data - clusters
generating real knowledge
improving the accuracy of the model
heuristics/educated guesses to improve the model and make it useful.
How model performs against unknown data
models cannot perform perfectly against unknown data because of noise.
Noise could be because of
complicated use case, like where feelings are involved.
data quality issues(too many null, unknown values)
Trying to model noise leads to overfitting.
The algo is said to have a bias if the conclusions are systematically erroneous or wrong in a predicable manner.
This mostly due to faulty training data or because of new data features which were never observed in training.
Lets say a machine is trained to identify the faces, and is shown a large no of pictures to train.
But in no case, a face with shades is shown.
So, while testing, if asked whether a face with shades is a face, it will classify it as not a face.
This is because of 'bias' in dataset which never contained a face with a shade.
Sometimes, Bias can also be considered 'useful' for machines.
Programs are code such that they must choose one way over the other and show 'bias' when starting to choose a way out of the countless of ways.
Data exploration and preparation
Can be of following types
categorical(nominal) => having fixed set of values
ordinal(categories falling in sorted list, like small, medium, large or grades in an exam)
normal, like male/female
Types of ML algos
supervised learning algos
given some data, supervised learning algo attempts to optimize a function(model) to find the combination of features that result in a target variable.
supervised learning is mostly used in classification.
unsupervised learning algos
done using clustering.
eg. identify items that are frequently bought together.
meta learners algos try to learn how to learn more effectively.
Supervised Learning Algos and usages
Nearest Neighbours => Classification
Naive Baiyes => Classification
Decision Trees => Classification
Classification rule learner => Classification
Linear Regression => Regression
Regression Trees => Regression
Model Trees => Regression
Neural Networks => Classification and Regression
Support Vector Machines => Classification and Regression
Association Rules => Pattern detection
k-means clustering => Clustering
Meta Learning Algorithms
Bagging => Classification and Regression
Boosting => Classification and Regression
Random Forests => Classification and Regression
Bias => ‘how much on an average are the predicted values different from the actual value.’
Variance => ‘how different will the predictions of the model be at the same point if different samples are taken from the same population’
the problem of overfitting vs underfitting of training data is known as bias-variance problem.
Overfitting tree => Good prediction => Low Bias <=> High Variance
technique to reduce the variance of the predictions by combining the result of various classifiers modelled on different sub samples of the same dataset.
Class of learner algorithms to convert a weak learner to a strong learner.
Types of Results
algo was true in prediction and gave the result as positive.
algo was true in prediction and gave the result as negative.
algo was false in prediction and gave the result as positive.
algo was false in prediction and gave the result as negative.
Normally an algo should try to get all true results(true predictions).
depending on the use case, a false positive may be more or less costly than a false negative.
like in case of cancer detection etc,
false negative is very costly as it will not detect cancer in someone who had.
false positive result will be detected later on when further treatment happens.
in a spam email scenario
would you like some of spam email to come to your inbox.(false negative)
would you like some normal mail to go to spam (false positive)
Here false positive is much more costly than false negative.