Ranking quality metrics. Metrics in machine learning problems Which metric to choose when classifying

UDC 519.816

S. V. SEMENIKHIN L. A. DENISOVA

Omsk State Technical University

MACHINE LEARNING METHOD FOR RANKING

BASED ON A MODIFIED GENETIC ALGORITHM FOR THE URCO METRIC

Considered the task of ranking documents on the results page information retrieval and questions of machine learning ranking. An approach is proposed to optimize the ranking function using the quality metric LOCO based on a modified genetic algorithm. The developed algorithms have been studied (on test collections of LETO^) and their effectiveness for machine learning in ranking has been shown.

Keywords Keywords: information retrieval, machine learning ranking, relevance, optimization, genetic algorithms.

1. Introduction. In modern information retrieval systems (IRS), the volume of data that the system operates on is so large that the key task is to rank relevant documents in response to a user's search query. At this stage of the development of IPS, machine learning (ML) for ranking is of the greatest interest. Existing approaches to ML based on numerical methods (in particular, on gradient methods) or on analytical calculations have a number of shortcomings that significantly affect the quality of information retrieval and the time required to rank relevant documents.

At the beginning of the research, list approaches to machine learning ranking were considered, most of which use the gradient descent method. In the considered works, ML is reduced to optimization of search quality metrics (QM), but only metrics represented by continuous functions are used. This limitation often leads to the fact that, as a result of optimization, the ranking function has lower scores for many important accepted indicators (DCG, nDCG, Graded Mean Reciprocal Rank, etc.), which are discrete functions. The paper proposes the use of genetic algorithms (GA) in ranking learning to minimize the Huber loss function using expert relevance estimates as reference values. An approach to ML based on the optimization of discrete information retrieval quality metrics was also proposed.

2. Statement of the problem of machine learning ranking. In most modern information retrieval systems, the ranking function is built on the basis of n simple ranking functions (PRF) and can be written as:

where SRF¡ is the ¡th simple ranking function for document d and query d, WCi is the weighting coefficient of the ¡th simple ranking function, n is the number of PFs in the ranking system.

In the course of machine learning for ranking, a set of search documents B and queries O from the test collection LBTOT was used. For all deO requests, a pair is formed with each deD document. For each such pair, the IPS determines the relevance values that are used to rank the SERP. In order to evaluate the quality of the ranking, the system needs reference values of relevance E for each pair of document-request t, e). For these purposes, expert relevance assessments are used.

For the study, an IPS was used, in which the ranking is based on N = 5 simple ranking functions SRFi(WC)l r = 1, N, which form a vector optimality criterion:

where WCе (WC) - vector of variable parameters; (SHS), (YB) are the spaces of parameters and vector criteria, respectively.

The use of genetic algorithms for MO ranking makes it possible to maximize discrete quality metrics such as nDCG. The nDCG metric for ranking documents in a search engine is determined in accordance with the expression:

DCG@n=X2---

RF(q, d)=XWC. ■ SRF., i=1 1 1

where grade(p) is the average relevance score assigned by experts to the document at position p in the list of results, gradee ; 1/log2(2 + p) - coefficient depending on the positions of the document (the first documents have more weight).

Then the normalized version of NDCG will be written in the form

N000 @ n = RSD @ n / r,

where r is the normalization factor, which is equal to the maximum possible value of 0C [email protected] n for given request(i.e. equal to the LLC of the ideal ranking).

Thus, in order to optimize (maximize) the metrics of the SFR, the objective function (JM) will be written in the following form

3. Quality metrics of search results ranking. When ranking documents in search results, quality metrics act as criteria. From the list of generally accepted metrics for assessing the quality of information retrieval systems, three main ones have been selected that evaluate the accuracy, relevance and completeness of information retrieval.

1. Accuracy criterion for information retrieval

where a is the number of relevant documents found, b is the number of documents mistaken for relevant.

2. Criterion Bpref, which evaluates the relevance of information retrieval, is used to process a task with R relevant documents and is calculated by the formula

Bpref = - ^ (1 - Non Re ¡Before(r)/ R). (4)

Here, r denotes a known relevant document, and NonRelBefore(r) is the number of known irrelevant documents ranked higher than r (only the first R evaluated irrelevant documents from the run are considered in the calculation).

3. The criterion for the completeness of search results

r = a / (a + c),

where a is the number of relevant documents found, c is the number of relevant documents not found.

4. Test collections. In a machine learning task, ranking requires a set of documents and queries with corresponding relevance scores determined by experts. This data is used for machine learning of the ranking function, as well as for quality assessment.

ranking of search results by the system. In the ML process, test collections are used as the training set and therefore have a significant impact on the results. A test collection of LETOR documents and queries was used for research. This collection is used in information retrieval research by Microsoft Research. In table. 1 shows the characteristics of the LETOR test collections.

5. Modified genetic algorithm. To use genetic algorithms in machine learning for ranking, the problem must be formulated in such a way that the solution is encoded as a vector (genotype), where each gene can be a bit, a number, or another object. In this case, the genotype is represented by a vector of weights for the respective ranking factors. The condition for stopping the execution of the genetic algorithm is to find optimal solution, the exhaustion of the number of generations or the time allotted for evolution.

It should be noted that GAs are most efficient in finding the global extremum region, however, they can be slow to work when it is necessary to find a local minimum in this region. The proposed way to avoid this shortcoming is to create a modified genetic algorithm (MGA) that will switch to a local (fast) optimization algorithm after finding the global optimum area using the base GA. The MGA proposed in the paper is a hybrid method based on the classical GA and the Nelder-Mead method (simplex algorithm). The Nelder-Mead method, a commonly used non-linear optimization algorithm, is a numerical method for finding the minimum of an objective function in a multidimensional space. The hybrid MGA algorithm proposed in this paper switches to the Nelder-Mead method after the conditions for stopping the GA are met. The block diagram of the MGA algorithm is shown in fig. one.

When performing research, a limit was adopted on the number of objective function calculations (Nrf = 16,000) when searching for a global extremum area and a condition for switching to a local optimization algorithm based on the Nelder-Mead method (after the basic genetic algorithm performs 75% of Nrf operations).

6. Results. As a result of the research carried out using the machine learning algorithm

Table 1

Number of documents and queries in test collections

Test collection name Subsystem name Number of queries Number of documents

LETOR 4.0 MQ2007 1692 69623

LETOR 4.0 MQ2008 784 15211

LETOR 3.0 OHSUMED 106 16140

LETOR 3.0 Gov03td 50 49058

LETOR 3.0 Gov03np 150 148657

LETOR 3.0 Gov03hp 150 147606

LETOR 3.0 Gov04td 75 74146

LETOR 3.0 Gov04np 75 73834

LETOR 3.0 Gov04hp 75 74409

Rice. 1. Block diagram of the hybrid MVL algorithm based on genetic algorithms and the Nelder-Mead method

LTR-MGA ranking received a vector of weight coefficients WC* for the ranking function. Further, on the basis of data from the LETOY test collection, the ranking quality was assessed, for which quality metrics were calculated. Discrete ranking quality metric [email protected] evaluates the quality of the first n documents of the system's response. The generally accepted metrics for evaluating the quality of a ranking are [email protected], [email protected] And [email protected] However, for a more detailed consideration of the changes in the metric depending on the values were considered [email protected] for all n from 1 to 10. To compare the efficiency of the developed algorithm with existing solutions, comparative analysis using the ranking algorithms provided in the LETTO 3.0 collections. The results of executing the algorithms for the test collections TB2003 and TB2004 for the NDCG metric are shown in fig. 2. The results show that the LTR-MGA algorithm outperforms the test algorithms, with the highest values having

are for NDC [email protected](at the level of the first document). The superiority of the LTR-MGA algorithm is due to the fact that, in contrast to the test ranking functions considered in the experiments, in the proposed approach to optimize the ranking function, it is the NDCG metric that is used as the objective function.

In order to evaluate the quality of ranking when using the proposed LTR-MGA algorithm, the values of the quality metrics for ranking documents in search results were calculated (Fig. 3). Comparison of ranking results (Table 2) when using the basic ranking function, basic algorithm LTR-GA and the modified LTR-MGA algorithm indicates the advantage of the latter.

In addition, the study performed an estimate of the time required for MO ranking. This is necessary to confirm that the proposed LTR-MGA method is superior in this indicator to the approach based on the use of traditional

Rice. 2. Comparison of machine learning algorithms for ranking

by NDCG metric for test collections: on the left - Gov03td dataset, on the right - Gov04td dataset

Rice. 3. Evaluation of ranking quality metrics for the basic ranking formula and learning algorithms LTR-GA and LTR-MGA

Ranking Quality Metrics for Different Ranking Machine Learning Algorithms

table 2

Ranging quality metric Basic ranking function LTR-GA LTR-MGA Metric increase, %

Accuracy 0.201 0.251 0.267 26.81

[email protected](first 5 documents) 0.149 0.31 0.339 90.47

[email protected](first 10 documents) 0.265 0.342 0.362 29.14

Bpref 0.303 0.316 0.446 51.49

Completeness 0.524 0.542 0.732 39.03

* The best values for the corresponding metric are highlighted in gray

genetic algorithm (NTL-OL). The results of comparing the time spent on the execution of the LTN-OL and LTN-MOL algorithms are given in Table. 3.

7. Conclusion. Thus, the conducted studies have shown that when using the proposed approach, the values of the considered ranking metrics in the IRS increase (on average by 19.55% compared to the LTR-OL algorithm). This confirms that LTR-MOL works correctly and significantly improves the ranking function, in other words, it successfully solves the optimization problem. With a modified algorithm

due to the application of the local optimization method and the introduced restrictions on the number of calculation of the objective function, the time of machine learning decreased (on average by 17.71% compared to the use of the traditional genetic algorithm LTNOL).

The developed machine learning algorithm for ranking LTN-MOL can be used in ISs using a ranking model based on a combination of simple ranking functions. However, some limitations on the proposed approach should be taken into account. Based

Estimating the execution time of machine learning ranking depending on the size of the training sample

Table 3

Document Text Collection Size

Runtime LTR-GA

Runtime LTR-MGA

Decrease in execution time, %

Mean

*The best values for the respective test collection size are highlighted in grey.

obtained results, it was revealed that after MO the greatest increase is in the ranking quality metric, the value of which was taken as the target function. At the same time, other metrics may not have a significant improvement, and in some cases even worsen their values. As one of the possible approaches to eliminate this shortcoming, it is supposed to solve the optimization problem as a multi-objective one: to uniformly improve several main ranking metrics of search results, instead of optimizing one. In addition, in further research it is planned to develop a methodology for constructing an objective function based on a linear convolution of the main ranking quality metrics to improve the information retrieval process.

Bibliographic list

1. Tie-Yan Liu. Learning to Rank for Information Retrieval // Journal Foundations and Trends in Information Retrieval. Vol. 3, issue 3. March 2009. P. 225-331.

2. Christopher J. C. Burges, Tal Shaked, Erin Renshaw. Learning to Rank using Gradient Descent // Proceeding ICML "05 Proceedings of the 22nd international conference on Machine learning. 2005. P. 89-96.

3. Semenikhin, S. V. Study of approaches to machine learning for document ranking search engine based on genetic algorithms / S. V. Semenikhin // Young Russia: advanced technologies for industry. - 2013. - No. 2. - S. 82 - 85.

4. Multicriteria optimization based on genetic algorithms in the synthesis of control systems: monograph. / L. A. Denisova. - Omsk: Publishing House of OmGTU, 2014. - 170 p. - ISBN 978-5-8149-1822-2.

5. Denisova, L. A. Automation of the parametric synthesis of a control system using a genetic algorithm / L. A. Denisova, V. A. Meshcheryakov // Automation in industry. - 2012. - No. 7. - S. 34 - 38.

6. Huber, Peter J. Robust Estimation of a Location Parameter // Annals of Statistics. - 1964. - No. 53. - P. 73-101.

7. Semenikhin, S. V. Automation of information retrieval based on multicriteria optimization and genetic algorithms / S. V. Semenikhin, L. A. Denisova // Dynamics of systems, mechanisms and machines. - 2014. - No. 3. - S. 224 - 227.

8. Tie-Yan Liu, Jun Xu, Tao Qin, Wenying Xiong and Hang Li. LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval // SIGIR 2007 Workshop on Learning to Rank for Information Retrieval. - 2007. - S. 3-10.

9. Ageev, M. S. Official metrics R0MIP "2004 / M. S. Ageev, I. E Kuralenok // II Russian seminar on the evaluation of information retrieval methods (ROMIP 2004), Pushchino, 2004: tr.; ed. I S. Nekrestyanova, St. Petersburg: Research Institute of Chemistry, St. Petersburg State University, pp. 142-150.

10. J. A. Nelder, R. Mead, A simplex method for function minimization, The Computer Journal 7 (1965). 308-313.

SEMENIKHIN Svyatoslav Vitalievich, post-graduate student of the department " Automated systems information processing and management". Address for correspondence: [email protected] DENISOVA Lyudmila Albertovna, Doctor of Technical Sciences, Associate Professor of the Department of Automated Information Processing and Control Systems. Address for correspondence: [email protected]

On the elements within each list. Partial order is usually specified by specifying a score for each item (eg, "relevant" or "not relevant"; more than two gradations are possible). The goal of a ranking model is to best (in a sense) approximate and generalize the way the rankings in the training set fit to new data.

Ranking learning is still a fairly young, rapidly developing field of research, which arose in the 2000s with the emergence of interest in the field of information retrieval in applying machine learning methods to ranking problems.

Encyclopedic YouTube

1 / 5
During the training of the ranking model and during its operation, each document-request pair is translated into a numerical vector of ranking features (also called ranking factors or signals) that characterize the properties of the document, the query, and their relationship. These signs can be divided into three groups:

The following are some examples of ranking features used in the well-known LETOR dataset in the field:
- Values of measures TF, TF-IDF , BM25 , and the language model of matching the request of various document zones (title, URL, body text, link text);
- Lengths and IDF sums of document zones;
- Document ranks obtained by various variations of link ranking algorithms such as PageRank and HITS .
Ranking quality metrics

There are several metrics that evaluate and compare the performance of ranking algorithms on a sample with peer reviews. Often the parameters of the ranking model tend to be adjusted in such a way as to maximize the value of one of these metrics.
Examples of metrics:

Classification of algorithms

In his article "Learning to Rank for Information Retrieval" and speeches at thematic conferences, Tai-Yan Liu from Microsoft Research Asia analyzed the methods currently available for solving the problem of learning to rank and proposed their classification into three approaches, depending on the input representation used data and function fine:

Pointwise approach

Notes
1. Tie Yan Liu (2009) Learning to Rank for Information Retrieval, Foundations and Trends in Information Retrieval: Vol. 3: No 3, p. 225-331, ISBN 978-1-60198-244-5 , DOI 10.1561/1500000016. Slides from T. Lew's speech at the WWW 2009 conference are available.
Hey Habr!

In machine learning tasks, metrics are used to assess the quality of models and compare different algorithms, and their selection and analysis is an indispensable part of the work of a data scientist.

In this article, we will look at some quality criteria in classification problems, discuss what is important when choosing a metric and what can go wrong.

Metrics in classification problems

For demonstration useful features sklearn and a visual representation of the metrics, we will use our telecom operator's customer churn dataset, which we met in the first article of the course.

Download the necessary libraries and look at the data
Import pandas as pd import matplotlib.pyplot as plt from matplotlib.pylab import rc, plot import seaborn as sns from sklearn.preprocessing import LabelEncoder, OneHotEncoder from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from sklearn.metrics import precision_recall_curve, classification_report from sklearn.model_selection import train_test_split df = pd.read_csv("../../data/telecom_churn.csv")
Df.head(5)
Data preprocessing
# Mapping binary columns # and dummy-encode the state (for simplicity, it's better not to do this for wooden models) d = ("Yes" : 1, "No" : 0) df["International plan"] = df[" International plan"].map(d) df["Voice mail plan"] = df["Voice mail plan"].map(d) df["Churn"] = df["Churn"].astype("int64" ) le = LabelEncoder() df["State"] = le.fit_transform(df["State"]) ohe = OneHotEncoder(sparse=False) encoded_state = ohe.fit_transform(df["State"].values.reshape(- 1, 1)) tmp = pd.DataFrame(encoded_state, columns=["state " + str(i) for i in range(encoded_state.shape)]) df = pd.concat(, axis=1)

Accuracy, precision and recall

Before moving on to the metrics themselves, an important concept needs to be introduced to describe these metrics in terms of classification errors - confusion matrix(error matrix).
Suppose we have two classes and an algorithm that predicts whether each object belongs to one of the classes, then the classification error matrix will look like this:

True Positive (TP) False Positive (FP)
False Negative (FN) True Negative (TN)
is the response of the algorithm on the object, and

The true class label on this object.
Thus, there are two types of classification errors: False Negative (FN) and False Positive (FP).

Algorithm training and error matrix construction
X = df.drop("Churn", axis=1) y = df["Churn"] # Divide the sample into train and test, all metrics will be evaluated on the test dataset X_train, X_test, y_train, y_test = train_test_split(X, y , stratify=y, test_size=0.33, random_state=42) # Train the native logistic regression lr = LogisticRegression(random_state=42) lr.fit(X_train, y_train) # Use the function to build the error matrix from the sklearn documentation def plot_confusion_matrix(cm, classes , normalize=False, title="(!LANG:Confusion matrix", cmap=plt.cm.Blues): """ This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`. """ plt.imshow(cm, interpolation="nearest", cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=45) plt.yticks(tick_marks, classes) if normalize: cm = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis] print("Normalized confusion matrix") else: print("Confusion matrix, without normalization") print(cm) thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape), range(cm.shape)): plt.text(j, i, cm, horizontalalignment="center", color="white" if cm > thresh else "black") plt.tight_layout() plt.ylabel("True label") plt.xlabel("Predicted label") font = {"size" : 15} plt.rc("font", **font) cnf_matrix = confusion_matrix(y_test, lr.predict(X_test)) plt.figure(figsize=(10, 8)) plot_confusion_matrix(cnf_matrix, classes=["Non-churned", "Churned"], title="Confusion matrix") plt.savefig("conf_matrix.png") plt.show()!}

Accuracy

An intuitive, obvious and almost unused metric is accuracy - the percentage of correct answers of the algorithm:

This metric is useless in problems with unequal classes and is easy to show with an example.

Let's say we want to evaluate the performance of a mail spam filter. We have 100 non-spam emails, 90 of which our classifier determined correctly (True Negative = 90, False Positive = 10) and 10 spam emails, 5 of which were also correctly determined by the classifier (True Positive = 5, False Negative = 5 ).
Then accuracy:

However, if we just predict all emails as non-spam, we get a higher accuracy:

At the same time, our model does not have any predictive power at all, since we initially wanted to identify spam emails. The transition from a common metric for all classes to individual indicators of class quality will help us overcome this.

Precision, recall and F-measure

To assess the quality of the algorithm on each of the classes separately, we introduce the metrics precision (accuracy) and recall (completeness).

Precision can be interpreted as the proportion of objects called positive by the classifier and at the same time are really positive, and recall shows what proportion of objects of a positive class out of all objects of a positive class the algorithm found.

It is the introduction of precision that does not allow us to write all objects into one class, since in this case we get an increase in the False Positive level. Recall demonstrates the algorithm's ability to detect a given class at all, while precision demonstrates the ability to distinguish this class from other classes.

As we noted earlier, there are two types of classification errors: False Positive and False Negative. In statistics, the first type of error is called a type I error, and the second type is called a type II error. In our task of determining the outflow of subscribers, the error of the first kind will be the mistake of a loyal subscriber for an outgoing one, since our null hypothesis is that none of the subscribers is outflowing, and we reject this hypothesis. Accordingly, an error of the second kind will be the "pass" of the outgoing subscriber and the erroneous acceptance of the null hypothesis.

Precision and recall do not depend, unlike accuracy, on the ratio of classes and therefore are applicable in conditions of unbalanced samples.
Often in real practice, the task is to find the optimal (for the customer) balance between these two metrics. A classic example is the problem of determining the outflow of customers.
It is clear that we cannot find all churning customers and only them. But having determined the strategy and resource for customer retention, we can select the necessary thresholds for precision and recall. For example, we can focus on retaining only high-margin customers or those who are more likely to churn, since we are limited by call center resources.

Usually, when optimizing the hyperparameters of an algorithm (for example, in the case of iterating over a grid GridSearchCV) uses one metric, the improvement of which we expect to see on the test sample.
There are several different ways to combine precision and recall into an aggregate quality measure. F-measure (generally

) - average harmonic precision and recall:

in this case determines the weight of accuracy in the metric and, when

this is the harmonic mean (with a factor of 2, so that in the case of precision = 1 and recall = 1 we have

)
The F-measure reaches its maximum at recall and precision equal to one and is close to zero if one of the arguments is close to zero.
sklearn has a handy _metrics.classification function report which returns the recall, precision, and F-measure for each of the classes, as well as the number of instances of each class.
Report = classification_report(y_test, lr.predict(X_test), target_names=["Non-churned", "Churned"]) print(report)
class precision recall f1 score support
Non-churned 0.88 0.97 0.93 941
Churned 0.60 0.25 0.35 159
avg/total 0.84 0.87 0.84 1100
It should be noted here that in the case of tasks with unbalanced classes that prevail in real practice, it is often necessary to resort to techniques for artificial modification of the dataset to equalize the ratio of classes. There are many of them and we will not touch them, you can look at some methods and choose the one that suits your task.

AUC-ROC and AUC-PR

When converting the real response of the algorithm (usually the probability of belonging to a class, see SVM separately) into a binary label, we must choose some threshold at which 0 becomes 1. A threshold of 0.5 seems natural and close, but it is not always turns out to be optimal, for example, in the aforementioned lack of class balance.

One way to evaluate the model as a whole, without being tied to a specific threshold, is AUC-ROC (or ROC AUC) - area ( A rea U nder C urve) under the error curve ( R eceiver O perating C characteristic curve). This curve is a line from (0.0) to (1.1) in True Positive Rate (TPR) and False Positive Rate (FPR) coordinates:

We already know TPR, this is completeness, and FPR shows what proportion of objects of the negative class the algorithm predicted incorrectly. In the ideal case, when the classifier makes no errors (FPR = 0, TPR = 1), we will get the area under the curve equal to one, otherwise, when the classifier randomly produces class probabilities, the AUC-ROC will tend to 0.5, since the classifier will issue the same amount of TP and FP.
Each point on the graph corresponds to the choice of some threshold. The area under the curve in this case shows the quality of the algorithm (more is better), in addition, the steepness of the curve itself is important - we want to maximize TPR while minimizing FPR, which means that our curve should ideally tend to the point (0,1).

ROC-curve drawing code
sns.set(font_scale=1.5) sns.set_color_codes("muted") plt.figure(figsize=(10, 8)) fpr, tpr, thresholds = roc_curve(y_test, lr.predict_proba(X_test)[:,1], pos_label=1) lw = 2 plt.plot(fpr, tpr, lw=lw, label="ROC curve ") plt.plot(, ) plt.xlim() plt.ylim() plt.xlabel("False Positive Rate ") plt.ylabel("True Positive Rate") plt.title("ROC curve") plt.savefig("ROC.png") plt.show()

The AUC-ROC criterion is resistant to unbalanced classes (spoiler: alas, not everything is so simple) and can be interpreted as the probability that a randomly selected positive object will be ranked higher by the classifier (it will have a higher probability of being positive) than a randomly selected negative an object.

Consider the following problem: we need to select 100 relevant documents from 1 million documents. We have machine-learned two algorithms:
- Algorithm 1 returns 100 documents, 90 of which are relevant. In this way,
- Algorithm 2 returns 2000 documents, 90 of which are relevant. In this way,
Most likely, we would choose the first algorithm, which produces very few False Positives compared to its competitor. But the difference in False Positive Rate between these two algorithms extremely small - only 0.0019. This is a consequence of the fact that AUC-ROC measures the proportion of False Positive relative to True Negative, and in tasks where the second (larger) class is not so important to us, it may not give an entirely adequate picture when comparing algorithms.

In order to correct the situation, let's return to completeness and accuracy:
- Algorithm 1
- Algorithm 2
There is already a significant difference between the two algorithms - 0.855 in accuracy!

Precision and recall are also used to plot the curve and, similar to AUC-ROC, find the area under it.

It can be noted here that on small datasets, the area under the PR curve can be overly optimistic, because it is calculated using the trapezoid method, but usually there is enough data in such tasks. For details on the relationship between AUC-ROC and AUC-PR, see here.

Logistic Loss

Standing apart is the logistic loss function, defined as:

is the algorithm's response to

ohm object,

true class label on

ohm object, and

sample size.

More about mathematical interpretation logistic function losses have already been written in the framework of the post about linear models.
This metric does not often appear in business requirements, but often in kaggle tasks.
Intuitively, one can think of minimizing logloss as a problem of maximizing accuracy by penalizing mispredictions. However, it should be noted that logloss severely penalizes the classifier's confidence in the wrong answer.

Consider an example:
Def logloss_crutch(y_true, y_pred, eps=1e-15): return - (y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)) print("Logloss with uncertain classification %f " % logloss_crutch(1, 0.5)) >> Logloss with uncertain classification 0.693147 print("Logloss with confident classification and correct answer %f" % logloss_crutch(1, 0.9)) >> Logloss with confident classification and correct answer 0.105361 print(" Logloss for sure classification and wrong answer %f" % logloss_crutch(1, 0.1)) >> Logloss for sure classification and wrong answer 2.302585
Note how dramatically increased logloss with the wrong answer and confident classification!
Therefore, the error on one object can give a significant deterioration in the overall error on the sample. Such objects are often outliers that must be remembered to filter or consider separately.
Everything falls into place if you draw a logloss graph:

It can be seen that the closer to zero the answer of the algorithm for ground truth = 1, the higher the error value and the steeper the curve grows.

Summing up:
- In the case of multi-class classification, you need to carefully monitor the metrics of each of the classes and follow the logic of the solution tasks, rather than optimizing the metric
- In the case of unequal classes, it is necessary to select a balance of classes for training and a metric that will correctly reflect the quality of the classification
- The choice of metric should be done with a focus on the subject area, pre-processing the data and, possibly, segmenting (as in the case of dividing into rich and poor customers)
useful links
1. Course by Evgeny Sokolov: Seminar on the choice of models (there is information on the metrics of regression problems)
2. Problems for AUC-ROC by A.G. Dyakonova
3. You can read more about other metrics on kaggle. A link to the competition where it was used has been added to the description of each metric
4. Presentation by Bogdan Melnyk aka ld86 about learning on unbalanced samples
In machine learning tasks, metrics are used to assess the quality of models and compare different algorithms, and their selection and analysis is an indispensable part of the work of a data scientist.
In this article, we will look at some quality criteria in classification problems, discuss what is important when choosing a metric and what can go wrong.

Metrics in classification problems
To demonstrate useful features sklearn and a visual representation of the metrics, we will use our telecom operator's customer churn dataset, which we met in the first article of the course.

Download the necessary libraries and look at the data
Import pandas as pd import matplotlib.pyplot as plt from matplotlib.pylab import rc, plot import seaborn as sns from sklearn.preprocessing import LabelEncoder, OneHotEncoder from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from sklearn.metrics import precision_recall_curve, classification_report from sklearn.model_selection import train_test_split df = pd.read_csv("../../data/telecom_churn.csv")
Df.head(5)

Data preprocessing
# Map the binary columns # and dummy-code the state (for simplicity, it's better not to do this for wooden models) d = ("Yes" : 1, "No" : 0) df["International plan"] = df[" International plan"].map(d) df["Voice mail plan"] = df["Voice mail plan"].map(d) df["Churn"] = df["Churn"].astype("int64" ) le = LabelEncoder() df["State"] = le.fit_transform(df["State"]) ohe = OneHotEncoder(sparse=False) encoded_state = ohe.fit_transform(df["State"].values.reshape(- 1, 1)) tmp = pd.DataFrame(encoded_state, columns=["state " + str(i) for i in range(encoded_state.shape)]) df = pd.concat(, axis=1)

Accuracy, precision and recall

Before moving on to the metrics themselves, an important concept needs to be introduced to describe these metrics in terms of classification errors - confusion matrix(error matrix).
Suppose we have two classes and an algorithm that predicts whether each object belongs to one of the classes, then the classification error matrix will look like this:

True Positive (TP) False Positive (FP)
False Negative (FN) True Negative (TN)
Here, is the response of the algorithm on the object, and is the true label of the class on this object.
Thus, there are two types of classification errors: False Negative (FN) and False Positive (FP).

Algorithm training and error matrix construction
X = df.drop("Churn", axis=1) y = df["Churn"] # Divide the sample into train and test, all metrics will be evaluated on the test dataset X_train, X_test, y_train, y_test = train_test_split(X, y , stratify=y, test_size=0.33, random_state=42) # Train native logistic regression lr = LogisticRegression(random_state=42) lr.fit(X_train, y_train) # Use the function to build the error matrix from the sklearn documentation def plot_confusion_matrix(cm, classes , normalize=False, title="(!LANG:Confusion matrix", cmap=plt.cm.Blues): """ This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`. """ plt.imshow(cm, interpolation="nearest", cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=45) plt.yticks(tick_marks, classes) if normalize: cm = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis] print("Normalized confusion matrix") else: print("Confusion matrix, without normalization") print(cm) thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape), range(cm.shape)): plt.text(j, i, cm, horizontalalignment="center", color="white" if cm > thresh else "black") plt.tight_layout() plt.ylabel("True label") plt.xlabel("Predicted label") font = {"size" : 15} plt.rc("font", **font) cnf_matrix = confusion_matrix(y_test, lr.predict(X_test)) plt.figure(figsize=(10, 8)) plot_confusion_matrix(cnf_matrix, classes=["Non-churned", "Churned"], title="Confusion matrix") plt.savefig("conf_matrix.png") plt.show()!}

Accuracy

An intuitive, obvious and almost unused metric is accuracy - the percentage of correct answers of the algorithm:

This metric is useless in problems with unequal classes and is easy to show with an example.

Let's say we want to evaluate the performance of a mail spam filter. We have 100 non-spam emails, 90 of which our classifier determined correctly (True Negative = 90, False Positive = 10) and 10 spam emails, 5 of which were also correctly determined by the classifier (True Positive = 5, False Negative = 5 ).
Then accuracy:

However, if we just predict all emails as non-spam, we get a higher accuracy:

At the same time, our model does not have any predictive power at all, since we initially wanted to identify spam emails. The transition from a common metric for all classes to individual indicators of class quality will help us overcome this.

Precision, recall and F-measure

To assess the quality of the algorithm on each of the classes, we separately introduce the metrics precision (accuracy) and recall (completeness).

Precision can be interpreted as the proportion of objects called positive by the classifier and at the same time are really positive, and recall shows what proportion of objects of a positive class out of all objects of a positive class the algorithm found.

It is the introduction of precision that does not allow us to write all objects into one class, since in this case we get an increase in the False Positive level. Recall demonstrates the algorithm's ability to detect a given class at all, while precision demonstrates the ability to distinguish this class from other classes.

As we noted earlier, there are two types of classification errors: False Positive and False Negative. In statistics, the first type of error is called a type I error, and the second type is called a type II error. In our task of determining the outflow of subscribers, the error of the first kind will be the mistake of a loyal subscriber for a leaving one, since our null hypothesis is that none of the subscribers leaves, and we reject this hypothesis. Accordingly, an error of the second kind will be the "pass" of the outgoing subscriber and the erroneous acceptance of the null hypothesis.

Precision and recall do not depend, unlike accuracy, on the ratio of classes and therefore are applicable in conditions of unbalanced samples.
Often in real practice, the task is to find the optimal (for the customer) balance between these two metrics. A classic example is the problem of determining the outflow of customers.
It is clear that we cannot find all churning customers and only them. But having determined the strategy and resource for customer retention, we can select the necessary thresholds for precision and recall. For example, we can focus on retaining only high-margin customers or those who are more likely to leave, since we are limited in call center resources.

Usually, when optimizing the hyperparameters of an algorithm (for example, in the case of iterating over a grid GridSearchCV) uses one metric, the improvement of which we expect to see on the test sample.
There are several different ways to combine precision and recall into an aggregate quality measure. F-measure (generally ) - harmonic mean precision and recall:

In this case, it determines the weight of the accuracy in the metric, and at the same time the harmonic mean (with a factor of 2, so that in the case of precision = 1 and recall = 1 have )
The F-measure reaches its maximum at recall and precision equal to one and is close to zero if one of the arguments is close to zero.
sklearn has a handy _metrics.classification function report which returns the recall, precision, and F-measure for each of the classes, as well as the number of instances of each class.
Report = classification_report(y_test, lr.predict(X_test), target_names=["Non-churned", "Churned"]) print(report)
class precision recall f1 score support
Non-churned 0.88 0.97 0.93 941
Churned 0.60 0.25 0.35 159
avg/total 0.84 0.87 0.84 1100
It should be noted here that in the case of tasks with unbalanced classes that prevail in real practice, it is often necessary to resort to techniques for artificial modification of the dataset to equalize the ratio of classes. There are many of them and we will not touch on them, you can look at some methods and choose the one that suits your task.

AUC-ROC and AUC-PR

When converting the real response of the algorithm (usually the probability of belonging to a class, see SVM separately) into a binary label, we must choose some threshold at which 0 becomes 1. A threshold of 0.5 seems natural and close, but it is not always turns out to be optimal, for example, in the aforementioned lack of class balance.

One way to evaluate the model as a whole, without being tied to a specific threshold, is AUC-ROC (or ROC AUC) - area ( A rea U nder C urve) under the error curve ( R eceiver O perating C characteristic curve). This curve is a line from (0.0) to (1.1) in True Positive Rate (TPR) and False Positive Rate (FPR) coordinates:

We already know TPR, this is completeness, and FPR shows what proportion of objects of the negative class the algorithm predicted incorrectly. In the ideal case, when the classifier makes no errors (FPR = 0, TPR = 1), we will get the area under the curve equal to one; otherwise, when the classifier randomly outputs class probabilities, AUC-ROC will tend to 0.5 since the classifier will output the same amount of TP and FP.
Each point on the graph corresponds to the choice of some threshold. The area under the curve in this case shows the quality of the algorithm (more is better), in addition, the steepness of the curve itself is important - we want to maximize TPR while minimizing FPR, which means that our curve should ideally tend to the point (0,1).

ROC-curve drawing code
sns.set(font_scale=1.5) sns.set_color_codes("muted") plt.figure(figsize=(10, 8)) fpr, tpr, thresholds = roc_curve(y_test, lr.predict_proba(X_test)[:,1], pos_label=1) lw = 2 plt.plot(fpr, tpr, lw=lw, label="ROC curve ") plt.plot(, ) plt.xlim() plt.ylim() plt.xlabel("False Positive Rate ") plt.ylabel("True Positive Rate") plt.title("ROC curve") plt.savefig("ROC.png") plt.show()

The AUC-ROC criterion is resistant to unbalanced classes (spoiler: alas, not everything is so simple) and can be interpreted as the probability that a randomly selected positive object will be ranked higher by the classifier (will have a higher probability of being positive) than a randomly selected negative object .

Consider the following problem: we need to select 100 relevant documents from 1 million documents. We have machine-learned two algorithms:
- Algorithm 1 returns 100 documents, 90 of which are relevant. In this way,
- Algorithm 2 returns 2000 documents, 90 of which are relevant. In this way,
Most likely, we would choose the first algorithm, which produces very few False Positives compared to its competitor. But the difference in False Positive Rate between these two algorithms extremely small - only 0.0019. This is a consequence of the fact that AUC-ROC measures the proportion of False Positive relative to True Negative, and in tasks where the second (larger) class is not so important to us, it may not give an entirely adequate picture when comparing algorithms.

In order to correct the situation, let's return to completeness and accuracy:
- Algorithm 1
- Algorithm 2
There is already a significant difference between the two algorithms - 0.855 in accuracy!

Precision and recall are also used to plot the curve and, similar to AUC-ROC, find the area under it.

It can be noted here that on small datasets, the area under the PR curve can be overly optimistic, because it is calculated using the trapezoid method, but usually there is enough data in such tasks. For details on the relationship between AUC-ROC and AUC-PR, see here.

Logistic Loss

Standing apart is the logistic loss function, defined as:

here is the response of the algorithm on the -th object, the true class label on the -th object, and the sample size.

Details about the mathematical interpretation of the logistic loss function have already been written in the post about linear models.
This metric does not often appear in business requirements, but often in kaggle tasks.
Intuitively, one can think of minimizing logloss as a problem of maximizing accuracy by penalizing mispredictions. However, it should be noted that logloss severely penalizes the classifier's confidence in the wrong answer.

Consider an example:
Def logloss_crutch(y_true, y_pred, eps=1e-15): return - (y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)) print("Logloss with uncertain classification %f " % logloss_crutch(1, 0.5)) >> Logloss with uncertain classification 0.693147 print("Logloss with confident classification and correct answer %f" % logloss_crutch(1, 0.9)) >> Logloss with confident classification and correct answer 0.105361 print(" Logloss for sure classification and wrong answer %f" % logloss_crutch(1, 0.1)) >> Logloss for sure classification and wrong answer 2.302585
Note how dramatically increased logloss with the wrong answer and confident classification!
Therefore, the error on one object can give a significant deterioration in the overall error on the sample. Such objects are often outliers that must be remembered to filter or consider separately.
Everything falls into place if you draw a logloss graph:

It can be seen that the closer to zero the answer of the algorithm for ground truth = 1, the higher the error value and the steeper the curve grows.

To summarize:
- In the case of multi-class classification, you need to carefully monitor the metrics of each of the classes and follow the logic of the solution tasks, rather than optimizing the metric
- In the case of unequal classes, it is necessary to select a balance of classes for training and a metric that will correctly reflect the quality of the classification