Solutions to the most common errors in machine learning

Author: Release time:2022-08-08 Source: Font: Big Middle Small View count:150

When you build your first model, avoid these 5 mistakes.


Data science and machine learning are becoming increasingly popular, and the number of people in the field is growing daily. This means that many data scientists are building their first machine learning models without extensive experience, and this is where mistakes can happen.


Recently, Agnis Liukis, a software architect, data scientist, and Kaggle guru, wrote an article in which he talks about solutions to some of the most common beginner mistakes in machine learning to make sure beginners understand and avoid them.


The following is the content of the article.


In the field of machine learning, beginners avoid these 5 mistakes.


1. Not using data normalization where it's needed


It's easy to normalize data, then get features and feed them into a model to make predictions. However, in some cases, the results of this simple approach can be disappointing because it is missing a very important part.


Some models require data normalization, such as linear regression, classical neural networks, etc. These models use feature values to multiply the weights of the training values. In the case of non-normalized features, the possible range of one feature value may be different from the possible range of another feature value.


Suppose the value of one feature is in the range of [0, 0.001] and the value of the other is in the range of [100000, 200000]. For a model that makes both features equally important, the weight of the first feature will be 100 million times greater than the weight of the second feature. The huge weights may cause severe problems for the model, such as when there are some outliers. In addition, estimating the importance of various features becomes difficult because a large weight may mean the feature is important. Still, it may also just mean that its eigenvalue is small.


After normalization, the values of all features are in the same range, usually [0, 1] or [-1, 1]. In this case, the weights will be in a similar range and correspond closely to the actual importance of each feature.


Using data normalization where needed will yield better and more accurate predictions.


2. Thinking that more features are better


One might think it is a good idea to include all features, thinking that the model will automatically select and use the best features. In reality, this idea is hardly accurate.


The more features a model has, the greater the risk of overfitting. Even in entirely random data, the model can find some features (signals), albeit sometimes weaker and sometimes stronger. Of course, there is no real signal in random noise. However, if we have enough noise columns, the model can use some of them based on the detected fault signals. When this happens, the model predictions will be lower quality because they are somewhat based on random noise.


There are now many techniques to help us with feature selection. But you must remember to explain each feature you have and why this feature will help your model.


3. Use tree-based models when extrapolation is needed


Tree-based models are easy to use and powerful, which is why they are popular. However, it may be wrong to use a tree-based model in some cases.


Tree-based models cannot be extrapolated; these models will never have predictions larger than the maximum value in the training data and will never output a smaller prediction than the minimum value in training.


In some tasks, the ability to extrapolate can be very important. For example, if the model predicts stock prices, future stock prices may be higher than ever. In this case, tree-based models would not be directly usable because their predictions would almost exceed the highest historical prices.


There are multiple solutions to this problem; one solution is to predict changes or variances rather than directly predicting values. Another solution is to use a different type of model for such tasks. Linear regression or neural networks can then be extrapolated.


4. Use data normalization where it is not needed


The previous article discussed the need for data normalization, but this is not always the case. Tree-based models do not require data normalization. Neural networks may not need explicit normalization either because some networks already contain normalization layers internally, such as the BatchNormalization operation of the Keras library.


In some cases, even linear regression may not require data normalization, which means that all features are already in a similar range of values and have the same meaning. For example, if the model is applied to time series data and all features are historical values of the same parameter. 5.


5. Leakage between the training and validation/test sets


It is easier than one might think to cause data leakage. Consider the following code snippet.

image.png

Example features for data leakage


Both features (sum_feature and diff_feature) are incorrect. They are leaking information because after splitting to the training set/test set, the part with the training data will contain some information from the test. This will result in higher validation scores but worse performance when applied to the actual data model.


The correct approach is to separate the training set/test set before applying the feature generation function. Usually, it is a good feature engineering model to handle the training and test sets separately.


In some cases, it may be necessary to pass some information between the two - for example, we may want to use the same StandardScaler on both the test and training sets.


Overall, it is good to learn from your mistakes, and I hope the examples of mistakes provided above will help you.


Hot News

Hot product