Important Topics in Machine Learning That Every Data Scientist Must Know

栏目: IT技术 · 发布时间: 4年前

内容简介：Machine learning (ML) is “an application of artificial intelligence that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.” ML algorithms are used to find patterns in data that generate ins

Evaluating the Basics of Machine Learning

Trist'n Joseph

Jul 24 ·7min read

Important Topics in Machine Learning That Every Data Scientist Must Know — Image by Trist’n Joseph

Machine learning (ML) is “an application of artificial intelligence that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.” ML algorithms are used to find patterns in data that generate insight and help make data-driven decisions and predictions. These types of algorithms are employed every day to make critical decisions in medical diagnosis, stock trading, transportation, legal matters and much more. Therefore, it can be seen why data scientists place ML on such a high pedestal; it provides a medium for high priority decisions, that can guide better business and smart actions, in real-time without human intervention.

Now, ML models do not necessarily ‘learn’ like how humans learn. Rather, these algorithms use computational methods to understand information directly from data without relying on a predetermined equation as a model. To do this, the algorithms are made to determine a pattern in data and develop a target function which best maps an input variable, x , to a target variable, y . It must be noted here that the true form of the target function is usually unknown. If the function was known, then ML would not be needed.

Therefore, the idea is to determine the best estimate of this target function by conducting sound inference about the sample data to then apply and optimize the appropriate ML technique for the situation at hand. Different situations require that different assumptions be made about the form of the function being estimated. Additionally, different ML algorithms make different assumptions about the shape the function, and thus, how it should be optimized. Understandably, it is easy to get overwhelmed with how much there is to learn with ML. So, in this post, I discuss two important topics in ML that every data scientist should know.

The Type of Learning

ML algorithms are often categorized as either supervised or unsupervised , and this broadly refers to whether the dataset being used in labelled or not. Supervised ML algorithms apply what has been learned in the past to new data by using labelled examples to predict future outcomes. Essentially, the correct answer is known for these types of problems and the estimated model’s performance is judged based on whether or not the predicted output is correct. In contrast, unsupervised ML algorithms refer to those developed when the information used to train the model is neither classified nor labelled. These algorithms work by attempting to make sense out of data by extracting features and patterns that can be found within the sample.

Now semi-supervised learning does exist, and it takes the middle ground between supervised and unsupervised learning. That is, a small portion of the data might be labelled, and the remainder is not.

Supervised learning is useful when the task given is a classification or regression problem. Classification problems refer to grouping observations or input data into discrete ‘classes’ based on particular criteria developed by the model. A typical example of this would be predicting whether an email is spam or non-spam. The model would be developed and trained on a dataset containing both spam and non-spam emails, where each observation is appropriately labelled.

Regression problems, on the other hand, refer to the process of accepting a set of input data and determining a continuous quantity as the output. A common example of this is predicting an individual’s income, given their education level, gender, and the total amount of hours worked.

Unsupervised learning is most appropriate when the answer to a particular question is more or less unknown. These algorithms are mainly used for clustering and anomaly detection because it is possible to detect similarities throughout observations without knowing exactly what the observation refers to. For example, one can look at the colour, size, and shape of various flowers and then roughly separate them into groups without truly knowing the species of each flower. Additionally, consider a credit card company monitoring consumer behaviour. It would be possible to detect fraudulent transactions by monitoring where transactions have occurred. For example, consider a credit card is frequently used in New York. If on a particular day, the card is used in New York, Los Angeles and Hongkong, then it could be considered an anomaly and the system should alert the relevant parties.

2. Model Fitting

Fitting a model refers to making an algorithm determine the relationship between the predictors and the outcome so that future values can be predicted. Recall that the models are developed using training data, which is ideally a large random sample that accurately reflects a population. This necessary action comes with some very undesirable risks. Fully accurate models are difficult to estimate because sample data are subject to random noise. This random noise, along with the number of assumptions made by the researcher, has the potential to cause ML models to learn fake patterns within the data. If one tries to combat this risk by making too few assumptions, it can cause the model to not learn enough information from the data. These issues are known as overfitting and underfitting, and the goal is to determine an appropriate mix between simplicity and complexity.

Overfittingoccurs when a model learns ‘too much’ from the training data, including random noise. Models are then able to determine very intricate patterns within the data, but this negatively affects the performance on new data. The noise picked up in the training data does not apply to new or unseen data, and the model is unable to generalize the patterns found. Certain ML models are more prone to overfitting than others, and these include both nonlinear and nonparametric models. For these types of models, overfitting can be overcome altering the model itself. Consider a nonlinear equation to the 4th power. It is possible to reduce overfitting by reducing the power of the model to maybe the 3rd power once acceptable results will still be produced. Alternatively, overfitting can be limited by applying cross-validation or regularization to the model parameters.

Underfitting, on the other hand, occurs when a model is unable to learn a sufficient amount of information from the training data. Then models are then unable to determine suitable patterns within the data, and this negatively affects the performance on new data. Since very little is learned, the model cannot apply much to unseen data and it is unable to generalize observations for the research problem at hand. Commonly, underfitting is as a result of model misspecification and can be fixed by using a more appropriate ML algorithm. For example, is a linear equation is used to estimate a nonlinear problem, underfitting will occur. Although this is true, underfitting can also be corrected through cross-validation and parameter regularization.

Cross-validationis a technique used to evaluate a model’s fit by training several models on various subsets of the sample dataset and then evaluating them on a complementary subset of the training set.

Regularizationrefers to the process of adding information to a model parameter in order to combat poor model performance. This can be through specifying that a parameter follows a particular distribution, such as the normal distribution versus a uniform distribution; or by giving a range of values that a parameter must fall within.

Machine learning models are extremely powerful, but with great power comes great responsibility. Developing the most appropriate ML model requires that the researcher adequately understands the problem at hand and what techniques will be suitable given the circumstance. Understanding whether a problem is supervised or unsupervised will provide some insight into what type of ML algorithm will be used; while understanding the model fit can prevent poor model performance when deployed. Happy modelling!

References:

expertsystem.com/machine-learning-definition/#:~:text=Machine%20learning%20is%20an%20application,use%20it%20learn%20for%20themselves.

mathworks.com/discovery/machine-learning.html#:~:text=Machine%20learning%20algorithms%20use%20computational,specialized%20form%20of%20machine%20learning.

nvidia.com/blog/2018/08/02/supervised-unsupervised-learning/#:~:text=In%20a%20supervised%20learning%20model,and%20patterns%20on%20its%20own.

machinelearningmastery.com/how-machine-learning-algorithms-work/

Other Useful Material:

simplilearn.com/importance-of-machine-learning-for-data-scientists-article

towardsdatascience.com/important-topics-in-machine-learning-you-need-to-know-21ad02cc6be5

machinelearningmastery.com/classification-versus-regression-in-machine-learning/#:~:text=A%20regression%20problem%20requires%20the,called%20a%20multivariate%20regression%20problem.

scikit-learn.org/stable/modules/cross_validation.html

nintyzeros.com/2020/03/regularization-machine-learning.html

以上所述就是小编给大家介绍的《Important Topics in Machine Learning That Every Data Scientist Must Know》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

Important Topics in Machine Learning That Every Data Scientist Must Know

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

一个APP的诞生

Carol 炒炒、刘焯琛 / 电子工业出版社 / 2016-7-1 / 79

在移动互联网高度发达的今天，一个个APP，成为我们通向网络世界的窗口。它的诞生流程，令不少对互联网世界产生幻想甚至试图投身其中的年轻人充满了好奇。《一个APP 的诞生》就是这样一步一步拆分一个APP 的诞生过程。从前期市场调研，竞品分析开始，一直到设计规范，界面图标，设计基础，流程管理，开发实现，市场推广，服务设计，甚至跨界融合，都有陈述。《一个APP 的诞生》被定义是一本教科书，......一起来看看《一个APP的诞生》这本书的介绍吧!

码农工具