Navchannya rankzhuvannya. A universal approach (major) to any machine knowledge requirement. Metric of machine capacity.

Machinery-based work on estimating the cost of models and leveling various algorithms produces metrics, such as selection and analysis - an integral part of a data scientist’s work.

In this article, we look at various criteria of accuracy in classification problems, discuss what is important when choosing a metric, and what may be wrong.

Metrics for classification

To demonstrate basic functions sklearn And then set up the metrics and we will analyze our dataset based on the number of telecom operator clients we met in the first statistics course.

Indispensably necessary libraries and depending on the data

Import pandas as pd import matplotlib.pyplot as plot with matplotlib.pylab import rc, plot import seaborn as sns with sklearn. from sklearn.metrics import precision_recall_curve, classification_report from sklearn.model_selection import train_test_split df = pd.read_csv("../../data/telecom_churn.csv")

Df.head(5)


Reworking the data

# Mapping binary columns # and dummy-coded staff (for simplicity, it’s better not to do this for wooden models) d = ("Yes": 1, "No": 0) df["International plan"] = df ["International plan"].map(d) df["Voice mail plan"] = df["Voice mail plan"].map(d) df["Churn"] = df["Churn"].astype(" int64" ) le = LabelEncoder() df["State"] = le.fit_transform(df["State"]) ohe = OneHotEncoder(sparse=False) encoded_state = ohe.fit_transform(df["State"].values.reshape (- 1, 1)) tmp = pd.

Accuracy, precision and recall

Before moving on to the metrics themselves, it is necessary to introduce an important concept for describing these metrics in the classification terms. confusion matrix(pardon matrix).
Let’s assume that we have two classes and an algorithm that transfers the identity of a skin object to one of the classes, then the classification matrix will look like this:

True Positive (TP) False Positive (FP)
False Negative (FN) True Negative (TN)

Here is a reference to the algorithm on the object, and a reference label to the class on the object.
Thus, there are two types of classifications: False Negative (FN) and False Positive (FP).

Navchannya algorithm and subsequent matrix of benefits

X = df.drop("Churn", axis=1) y = df["Churn"] # Divide the sample into train and test, all metrics are assessed on the test dataset X_train, X_test, y_train, y_test = train_test_split(X, y , stratify=y, test_size=0.33, random_state=42) # Let's start with random logistic regression lr = LogisticRegression(random_state=42) lr.fit(X_train, y_train) # Speed ​​function of the matrix , normalize=False, title=" Confusion matrix", cmap=plt.cm.Blues): """ This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`. """ plt.imshow(cm, interpolation="nearest", cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=45) plt.yticks(tick_marks, classes) if normalize: cm = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis] print("Normalized confusion matrix") else: print("Confusion matrix, without normalization") print(cm) thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape), range(cm.shape)): plt.text(j, i, cm, horizontalalignment="center", color="white" if cm > thresh else "black") plt.tight_layout() plt.ylabel("True label") plt.xlabel("Predicted label") font = {"size" : 15} plt.rc("font", **font) cnf_matrix = confusion_matrix(y_test, lr.predict(X_test)) plt.figure(figsize=(10, 8)) plot_confusion_matrix(cnf_matrix, classes=["Non-churned", "Churned"], title="Confusion matrix") plt.savefig("conf_matrix.png") plt.show()!}!}


Accuracy

An intuitively sensible, obvious and perhaps unexplored metric is accuracy – some of the correct answers to the algorithm:

This metric is useful for jobs with uneven classes and is easy to show in practice.

It is possible that we would like to evaluate the robot's spam filter. We have 100 non-spam lists, 90 of which our classifier has calculated correctly (True Negative = 90, False Positive = 10) and 10 spam lists, 5 of which our classifier has also calculated correctly (True Positive = 5, False Negative = 5 ).
Todi accuracy:

However, if we simply prophesy all the leaves as non-spam, then we will take away higher accuracy:

Moreover, our model has absolutely no transfer force, which is why we initially wanted to identify the leaves with spam. This will help us move from a basic metric for all classes to other indicators of the ranks of classes.

Precision, recall and F-measure

To assess the efficiency of the robot, we will directly introduce the metrics precision (accuracy) and recall (recall) into the skin class algorithm.

Precision can be interpreted as the portion of objects that are called positive by the classifier and thus effectively positive, and recall shows which portion of the positive class objects from all the positive class objects know the algorithm.

The very introduction of precision does not allow us to write all objects into one class, so in this case we eliminate the growing value of False Positive. Recall demonstrates the ability of the algorithm to identify this class from time to time, and precision demonstrates the ability to distinguish this class from other classes.

As we said earlier, there are two types of classifications: False Positive and False Negative. According to statistics, the first type of favors is called a favor of the 1st kind, and the other is called a favor of the 2nd kind. Our main task is to attract subscribers, with the first kind mercy will be the acceptance of a loyal subscriber for the one who goes, since our null hypothesis lies in the fact that no one goes for the subscribers, and we reject this hypothesis. Apparently, a different kind of mercy would be the “passing” of the subscriber, where he is going, and the mercy of accepting the null hypothesis.

Precision and recall do not lie in the category of accuracy, due to the classification of classes and thus the responsibility for the minds of unbalanced selections.
Often in real practice it is necessary to know the optimal (for the manager) balance between these two metrics. The classic butt is intended to support clients.
It's obvious that we can't talk everyone clients who go to outlets only Ex. Once we have identified a strategy and resource for attracting clients, we can select the required thresholds using precision and recall. For example, you can focus on just a few high-profit clients or those who have greater confidence in sharing resources with the call center.

Consider, when optimizing the hyperparameters of the algorithm (for example, when searching through the site GridSearchCV) one metric is being tested, the extent of which can be seen in the test sample.
There are a number of different ways to combine precision and recall in aggregations of brilliance criteria. F-measure (zahalom) - mean harmonious precision and recall:

In this case, the value of precision in metric is taken into account, and the mean is harmonic (with a multiplier of 2, so that precision = 1 and recall = 1 mother)
The F-measure reaches a maximum when recall and precision are equal to one, and is close to zero when one of the arguments is close to zero.
sklearn has a manual function _metrics.classification report rotates recall, precision and F-entry for each class, as well as the number of copies of each class.

Report = classification_report(y_test, lr.predict(X_test), target_names=["Non-churned", "Churned"]) print(report)

class precision recall f1-score support
Non-turned 0.88 0.97 0.93 941
Churned 0.60 0.25 0.35 159
avg/total 0.84 0.87 0.84 1100

Here it should be noted that when faced with unbalanced classes, which prevail in real practice, it is often necessary to resort to the technique of individual modification of the dataset to ensure the consistency of classes. There is a lot of it and we won’t dwell on it, you can learn some methods and choose the one that suits your task.

AUC-ROC and AUC-PR

When converting the speech type to the algorithm (usually, the probability of belonging to the class, next to the SVM division) into a binary label, you must choose any threshold, when 0 becomes 1. Natural and close people are given a threshold , equal 0.5, but never again appears to be optimal, for example, given the expected balance of classes.

One way to evaluate the model as a whole, without being tied to a specific threshold, is AUC-ROC (or ROC AUC) - area ( A rea U nder C urve) under the pardon curve ( R eceiver O perating C characteristic curve). This curve is a line from (0,0) to (1,1) in True Positive Rate (TPR) and False Positive Rate (FPR) coordinates:

TPR is already known to us, absolutely, and FPR shows which part of the negative class objects the algorithm transferred incorrectly. Ideally, if the classifier does not make calculations (FPR = 0, TPR = 1), we take the area under the curve that is equal to the same unit; Otherwise, if the classifier automatically shows the strength of the classes, AUC-ROC will be reduced to 0.5, as long as the classifier only sees the number of TP and FP.
The skin point on the graph indicates the choice of a certain threshold. The area under the curve in this case shows the accuracy of the algorithm (longer is shorter), in addition, the steepness of the curve itself is important - we want to maximize TPR while minimizing FPR, which means that our curve ideally will overlap the points (0,1).

ROC curve image code

Sns.set(font_scale=1.5) sns.set_color_codes("muted") plt.figure(figsize=(10, 8)) fpr, tpr, thresholds = roc_curve(y_test, lr.predict_proba(X_test)[:,1], pos_label=1) lw = 2 plt.plot(fpr, tpr, lw=lw, label="ROC curve") plt.plot(, ) plt.xlim() plt.ylim() plt.xlabel("False Positive Rate ") plt.ylabel("True Positive Rate") plt.title("ROC curve") plt.savefig("ROC.png") plt.show()


The AUC-ROC criterion is stable to unbalanced classes (spoiler: unfortunately, not everything is so simple) and it can be interpreted as the certainty that a positive object will be ranked higher by the classifier (meaning a higher the irony of the but positive), the lower vipadkovo obrany negative object .

Let's look at this problem: we need to select 100 relevant documents from 1 million documents. We developed two algorithms:

  • Algorithm 1 turns 100 documents, 90 of which are relevant. In such a manner
  • Algorithm 2 turns 2000 documents, 90 of which are relevant. In such a manner

Rather, for everything, we would choose the first algorithm, which shows very little False Positive against the aphid of its competitor. There is no difference between False Positive Rate and these two algorithms finally small – only 0.0019. This is due to the fact that AUC-ROC suppresses the False Positive part rather than the True Negative part, and in fact, because the other (larger) class is not so important to us, it may not give a completely adequate picture with the same algorithms.

In order to correct the position, let’s turn it around completely and accurately:

  • Algorithm 1
  • Algorithm 2

There is already a noticeable difference between the two algorithms – 0.855 exactly!

Precision and recall can also be used to force a curve, similar to AUC-ROC, to find the area under it.


Here it can be noted that on small datasets the area under the PR curve may be overly optimistic, so it is calculated using the trapezoid method, otherwise there is enough data for such data. For details about the interaction between AUC-ROC and AUC-PR, you can go here.

Logistic Loss

The logistic function of expenses is clearly defined as:

Here is the correspondence to the algorithm on the object, the reference label to the class on the object, and the size of the selection.

A report on the mathematical interpretation of the logistics function of costs has already been written as part of the post about linear models.
This metric appears infrequently among business leaders, but rather often among managers on kaggle.
Intuitively, you can recognize logloss minimization as aiming to maximize accuracy at the expense of penalizing incorrect transfers. However, it is necessary to note that logloss even penalizes for entering a classifier in the wrong type.

Let's take a look at the butt:

Def logloss_crutch(y_true, y_pred, eps=1e-15): return - (y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)) print("Logloss for unspecified classification %f " % logloss_crutch(1, 0.5)) >> Logloss for unqualified classification 0.693147 print("Logloss for valid classification and true type %f" % logloss_crutch(1, 0.9)) >> Logloss for valid classification What is the correct line"3 0 1 Logloss with incorrect classification and incorrect type %f" % logloss_crutch(1, 0.1)) >> Logloss with incorrect classification and incorrect type 2.302585

It’s significant how dramatically logloss grew with the wrong type and the wrong classification!
Well, the act of killing one object can give rise to a significant increase in the number of victims in the election. Such objects are often hidden, so don’t forget to filter and look closely.
Everything falls into place when you draw a logloss graph:


It can be seen that the closer to zero the response to the algorithm is with ground truth = 1, the greater the value of the reduction and the steeper the growth of the curve.

Supposedly:

  • In case of a multi-class classification, it is necessary to carefully follow the metrics of each class and adhere to the logic of the decision zavdannya, not metric optimization
  • In case of unequal classes, it is necessary to select a balance of classes to develop a metric that correctly displays the level of classification
  • mephistopheies and madrugado for their assistance in preparing this article.

In the process of preparing a task for the entrance test for the GoTo summer school, we discovered that the Russian language had a practically daily clear description of the main ranking metrics (the task was related to the ranking task - prompting a recommendation algorithm y). We at E-Contenta are actively researching different ranking metrics, so we decided to correct this more misunderstanding by writing this article.

Ranking is happening everywhere: sorting web pages according to a given search term, personalizing the news page, recommending videos, products, music... In a word, the topic is hot. Let us give special attention to machine learning, which involves learning to rank. In order to choose the most suitable one from a wide variety of algorithms and approaches, you must evaluate their capacity carefully. More information about the most advanced ranking metrics is available below.

Briefly about the department and ranking

Ranzhuvannya - the department of sorting the set elements from merkuvan ich relevance. Most often the relevance is understood by absolutely anyone object. For example, in the task of information search, the object is the subject, the elements are all documents (sent to them), and the relevance is the relevance of the document to the request, in the task of the recommendation the object is the correspondent, the elements are that other recommended content (products, video, music), and relevance is the degree to which a customer will quickly (buy/like/view) this content.

Formally, we will look at N objects and M elements. The result of the robotic algorithm for ranking elements for an object is the image that gives the skin element a value, which characterizes the level of relevance of the element to the object (the greater the value, the more relevant the object). In this case, the set of values ​​specifies a permutation on the set of elements of elements (it is important that the non-personal elements are ordered) resulting from their sorting by changes in the values.

To evaluate the accuracy of ranking, it is necessary to have a clear “standard” so that the results can be compared with the algorithm. Let’s take a look at the standard relevance function, which characterizes the “relevant” relevance of elements for a given object (- the element is ideally suited, - completely irrelevant), as well as its corresponding permutation itself (after all).

There are two main methods of removal:
1. Based on historical data. For example, for each content recommendation, you can take the views (likes, purchases) of the user and assign 1 () to the views of the relevant elements, and 0 to the ratings.
2. Based on expert assessment. For example, in search of a search, you can get a team of assessors to manually evaluate the relevance of documents to the request.

Warto note that if there are extreme values: 0 and 1, then the permutation should not be considered and treated without any relevant elements, such as .

Meta metrics ranking- this means how much the relevance assessment algorithm has removed the corresponding permutation. true relevance values. Let's look at the main metrics.

Mean average precision

Mean average precision at K ( [email protected]) is one of the most commonly used ranking metrics. To understand how it works, let’s start with the “basics”.

Note: "*precision" metrics are used in binary problems, where they take only two values: 0 and 1.

Precision at K

Precision at K ( [email protected]) - accuracy on K elements - the basic metric of the ranking efficiency of one object. It is possible that our ranking algorithm has assessed the relevance of the skin element. By selecting the middle first elements with the largest ones, you can select some of the relevant ones. You can also use precision at K:

Respect: we mean the element that ends up in position as a result of the rearrangement. So, - an element with a maximum value, - an element with another value, and so on.

Average precision at K

Precision at K - the metric is simple to understand and implement, but there is an important shortcoming - it does not correct the order of the elements in the “top”. So, if out of ten elements we guessed only one, then it doesn’t matter which way it goes: the first one, or the remaining one, whichever it may be. However, it is obvious that the first option is much better.

This slightly reduces the ranking metric average precision at K ( [email protected]) , like ancient sums [email protected] for indices k from 1 to K only for relevant elements, divided by K:

So, since three elements were relevant, only those who were in the last place appeared, then if they guessed only the one who was in the first place, then , and if everything was guessed, then .

Now i [email protected] we can handle it.

Mean average precision on K

Mean average precision at K ( [email protected]) - one of the most commonly used ranking metrics. IN [email protected]і [email protected] The accuracy of the ranking is assessed for the surrounding object (koristuvach, joke search). In practice, the objects are anonymous: we are dealing with hundreds of thousands of customers, millions of sound queries, and so on. Idea [email protected] lies in order to enjoy [email protected] for the skin object and average:

Respect: the idea is entirely logical, since we assume that all people who need it are still needed and important. If this is not so, then instead of a simple average, you can vikorize the vivazhen by multiplying [email protected] skin object on the basis of your “importance” vaga.

Normalized Discounted Cumulative Gain

Normalized discounted cumulative gain (nDCG)- The ranking metric has been expanded yet again. How I fell out [email protected], Let's go over the basics.

Cumulative Gain at K

Let us again look at one object and the most important elements. Cumulative gain at K ( [email protected]) - the basic ranking metric, which is a simple idea: what are the relevant elements in this top, then more briefly:

This metric has obvious shortcomings: it is not normalized and does not reflect the position of relevant elements.

Dear, that in the administraion [email protected], [email protected] You can also use non-binary values ​​of reference relevance.

Discounted Cumulative Gain at K

Discounted cumulative gain at K ( [email protected]) - modification of the cumulative gain at K, which is the correct order of elements in the list by multiplying the relevance of the element by the value, which is similar to the reverse logarithm of the position number:

Note: since it only accepts the values ​​0 and 1, then the formula looks more simple:

The vicarious logarithm as a discounting function can be explained by the following intuitive considerations: from a glance, the ranking positions at the beginning of the list are much stronger than those at the end of the list. Thus, there is a complete break between the sound engine between positions 1 and 11 (only in a few cases in a hundred cases you can go beyond the first side of the sound type), and between positions 101 and 111 there are special It doesn’t matter - few people reach them. These subjective values ​​are miraculously expressed using an additional logarithm:

Discounted cumulative gain poses a problem regarding the position of the relevant elements, but further complicates the problem due to the multiplicity of normalization: as it varies between , then the value accumulates in a not entirely clear way. Virishity problem solved, metric addressed

Normalized Discounted Cumulative Gain at K

How can you guess the name, normalized discounted cumulative gain at K ( [email protected]) - nothing else, as the normalized version [email protected]:

de - tse maximum (I - ideal) value. The fragments have been housed, which brings significance to, then.

In this way, the position of the elements in the list decreases and the values ​​increase in the range from 0 to 1.

Respect: for the analogy with [email protected] it is possible to estimate, averaging over all objects.

Mean reciprocal rank

Mean reciprocal rank (MRR)- another ranking metric that is often reviewed. It is given by this formula:

de - reciprocal rank for the th object - a value that is even simpler in its essence, which is more ancient the renal wound of the first correctly guessed element.

Mean reciprocal rank changes in the range and the position of the elements. It’s a pity to work only for one element - one clearly conveyed, without extinguishing respect at all times.

Metrics based on rank correlation

You can also see the ranking metrics, which are based on one of the coefficients. rank correlation. In statistics, the rank correlation coefficient is a correlation coefficient that does not determine the values ​​themselves, but rather their rank (order). Let's look at the two broadest rank correlation coefficients: the Spearman and Kendell coefficients.

Kendell rank correlation coefficient

The first of them is the Kendell correlation coefficient, which relies on the support of our customers.
(and undesirable) pairs of permutations - pairs of elements to which the permutations have given a new (different) order:

Spearman's rank correlation coefficient

The other is the rank coefficient of the Spearman correlation, which is actually the same as the Pearson correlation, based on the rank values. There is a simple formula that expresses its ranks directly:

de – Pearson correlation coefficient.

Metrics based on rank correlation may already have shortcomings: they do not guarantee the position of elements (even worse [email protected], because correlation is considered for all elements, not for the elements with the highest rank). Therefore, it is practical to become a vikorist even rarely.

Metrics based on a cascade behavior model

Until now, we have not been deluded by those who know how the koristuvach (we will look further at the next branch of the object - the koristuvach) vitiates the elements assigned to it. In fact, we have implicitly created a mixture that looks at the skin element independent From the perspective of other elements – a kind of “reality”. In fact, the elements are most often visible from the front, and those that look like the front element are due to his satisfaction with the front ones. Let's take a look at the example: the search engine uses a ranking algorithm based on a number of documents. If documents in positions 1 and 2 appear to be extremely relevant, then the probability that a client will look at the document in position 3 is small, because You will be completely satisfied with the first two.

Similar models of customer behavior, the classification of the elements assigned to them are observed sequentially, and the likelihood of the re-look of the element depends on the relevance of the previous ones. cascading.

Expected reciprocal rank

Expected reciprocal rank (ERR)- An example of ranking efficiency metrics based on the cascade model. It is given by this formula:

The rank refers to the order of decline. The most important metric for this metric is its reliability. When they develop, the cascade model is used:

de - certainty that the koristuvach will be satisfied with an object of rank. Values ​​are calculated on the basis of value. So, as we have a problem, we can look at a simple option:

which can be read as: reference to the relevance of the element that is listed on the position Let's finish with a bunch of brownies

UDC 519.816

S. V. SEMENIKHIN L. A. DENISOVA

Omsk State Technical University

MACHINE FORCING METHOD

BASED ON A MODIFIED GENETIC ALGORITHM FOR THE IRCO METRIC

The current ranking of documents on the page of results of information search and the nutrition of machine ranking is reviewed. An approach has been proposed for optimizing the ranking function based on the vicoristics of the iOS jacuity metric based on a modified genetic algorithm. A further investigation of the fragmentation of algorithms was carried out (on test collections of ETO^) and their effectiveness for machine-based ranking was shown.

Key words: information search, machine-learned ranking, relevance, optimization, genetic algorithms.

1. Introduction. Today's information and sound systems (IPS) have a large amount of data available to the operating system, so the key task is to rank relevant documents in response to the customer's sound request. At this stage of IPS development, machine-assisted (MO) ranking is of greatest interest. Basic approaches to MR, based on numerical methods (zocrema, gradient methods) or on analytical approaches, have a number of shortcomings, which inevitably contribute to the bias of information search and time-consuming expenditure, Days for ranking relevant documents.

At the beginning of the research, a list of approaches to machine-based ranking was reviewed, in which the gradient descent method was used. In these robots, the MO is reduced to the optimization of the metric of search capacity (MCP), however, only metrics represented by non-interruptible functions are analyzed. This exchange often leads to the fact that as a result of optimization of the ranking function, the ranking function receives less high scores for a large number of important accepted indicators (DCG, nDCG, Graded Mean Reciprocal Rank, etc.), which are discrete functions. The robot is based on the use of genetic algorithms (GA) in the initial ranking to minimize the Huber cost function with the help of expert assessments of relevance as standard values. The approach to ML was also proposed based on the optimization of discrete metrics of information search capacity.

2. Statement of the problem of machine learning ranking. In most current information and search systems, the ranking function will be based on n simple ranking functions (SRF) and can be written in the form:

de SRF¡ is a simple ranking function for document d and input d, WCi is the coefficient of a simple ranking function, n is the number of PRFs in the ranking system.

In the course of machine-based ranking, a set of sound documents B and queries about the test collection BTOYA has been completed. For all deO applications, a pair with a deD skin document is formed. For each such pair, the IPS indicates the relevance value, which is used to rank the sound type. In order to evaluate the ranking value, the system requires standard relevance values ​​E for each document-query pair ^, e). With this method, expert assessments of relevance are used.

To carry out the research of the IPS vicoristan, the ranking is carried out on the basis of N = 5 simple ranking functions SRFi(WC)l g = 1, N, which establish the vector optimality criterion:

de WCе (WC) - vector of parameters that vary; (ShS), (YB) - the scope of parameters and vector criteria is consistent.

The use of genetic algorithms for ML ranking makes it possible to maximize discrete capacity metrics such as nDCG. The nDCG metric for ranking documents in the search engine is determined according to the virus:

DCG@n=X2---

RF (q, d) = X WC. ■ SRF., i=1 1 1

de grade(p) - average relevance score assigned by experts to a document ranked at position p in the list of results, gradee; 1/log2(2 + p) - coefficient that lies under the position of the document (the first documents have a greater value).

Then the NDCG version is normalized and will be written as

N000@p = VZG@P/g,

where g is the normalization factor, which is equal to the maximum possible value of 0C [email protected] I will ask for this (then the TOV is equivalent to the ideal ranking).

Thus, to optimize (maximize) the POSO metric, the goal function (YM) is written in the present form

3. Metrics of the quality of ranking of sound types. When ranking documents, the role of criteria is played by the accuracy metrics. From the list of generally accepted metrics for assessing the quality of the IPS, three main ones were selected to assess the accuracy, relevance and completeness of the information search.

1. Criterion for the accuracy of information search

where a is the number of relevant documents found, b is the number of documents accepted as relevant.

2. The Bpref criterion, which evaluates the relevance of an information search, is used to process the given R relevant documents and is calculated using the formula

Bpref = - ^ (1 - Non Re ¡Before(r)/R). (4)

Here the symbol r is the value of the relevant relevant document, and NonRelBefore(r) is the number of visible irrelevant documents ranked above r (when calculated, the first R ranked irrelevant documents per run are included).

3. Criterion for the repetition of sound type

r = a/(a+с),

de a – the number of relevant documents found, h – the number of relevant documents not found.

4. Test collections. The task of machine learning is to rank the required set of documents and queries based on the relevant relevance ratings given by experts. These data are used for machine development of ranking functions, as well as for estimating the capacity

Ranking of the sound type by the system During the MO process, test collections are selected as an initial selection and, therefore, can have a significant impact on the results. To carry out the research, a test collection of LETOR documents and queries was used. This collection is supported by research in the field of information search by Microsoft Research. In the table 1 shows the characteristics of the LETOR test collections.

5. Modified genetic algorithm. For the development of genetic algorithms in a machine, the ranking task must be set in such a way that the solution is encoded in the form of a vector (genotype), where the skin gene can be a bit, a number or another object. In this case, the genotype is represented by the vector of the pathogenic factors of the ranking factors. p align="justify"> The main idea behind the genetic algorithm is to find the optimal solution by calculating the number of generations or hours allocated for evolution.

It should be noted that GAs are most effective in the search for the region of the global extremum, but they can be used to great effect if it is necessary to know the local minimum in this region. The proposed way to eliminate this shortcoming is to create a modified genetic algorithm (MGA), which switches to a local (broadcast) optimization algorithm after finding the global optimum region using the base GA. Propositions in robotic MGA are performed using a hybrid method based on the classical GA and the Nelder-Mead method (simplex algorithm). The Nelder-Mead method, which is often used as a nonlinear optimization algorithm, is a numerical method for finding a minimum objective function in a wide variety of spaces. In this work, the hybrid MGA algorithm switches to the Nelder-Mida method after the removal of the minds of the GA group. The block diagram of the MGA algorithm is shown in Fig. 1.

When researching, it was accepted to focus on the number of calculations of the objective function (Nrf = 16,000) when searching for the region of the global extremum and mentally switching to a local optimization algorithm based on the Nelder-Mead method (after that, The new genetic algorithm contains 75% of Nrf operations).

6. Results. As a result of conducting research on the additional machine learning algorithm

Table 1

Number of documents and queries in test collections

Name of test collection Name of subsystem Number of queries Number of documents

LETOR 4.0 MQ2007 1692 69623

LETOR 4.0 MQ2008 784 15211

LETOR 3.0 OHSUMED 106 16140

LETOR 3.0 Gov03td 50 49058

LETOR 3.0 Gov03np 150 148657

LETOR 3.0 Gov03hp 150 147606

LETOR 3.0 Gov04td 75 74146

LETOR 3.0 Gov04np 75 73834

LETOR 3.0 Gov04hp 75 74409

Rice. 1. Flowchart of a hybrid MVL algorithm based on genetic algorithms and the Nelder-Mead method

For ranking LTR-MGA, the vector of coefficients WC* for the ranking function was extracted. Next, based on data from the test collection ETOYA, an assessment of the ranking accuracy was carried out, for which the accuracy metrics were calculated. Discrete ranking metric [email protected] evaluates the strength of the first documents in the system. Generally accepted metrics for assessing the quality of ranking [email protected], [email protected]і [email protected] Prote for a more detailed look at the changes in the metric in terms of the reviewed values [email protected] for all values ​​from 1 to 10. To equalize the effectiveness of the developed algorithm with other solutions, an equal analysis was carried out from the various ranking algorithms given in the ETOYA 3.0 collections. The results of testing algorithms for test collections TB2003 and TB2004 for the NDCG metric are presented in Fig. 2. The results show that the LTR-MGA algorithm outperforms the test algorithms, with the highest values ​​being

huddle for [email protected](On par with the first document). The advantage of the LTR-MGA algorithm is that, based on the test functions examined in the experiments, which are ranked, in the proposed approach for optimizing the function, the ranking metric itself NDCG is vicorized in Ile functions.

In order to evaluate the intensity of ranking in the case of vicoristics assigned to the LTR-MGA algorithm, the values ​​of the metrics of ranking intensity of documents in the search engine were calculated (Fig. 3). The equalization of ranking results (Table 2) with a different base ranking function, the basic LTR-GA algorithm and the modified LTR-MGA algorithm indicates the superiority of the remaining one.

In addition, with additional research, an estimate of the hour required for MO ranking is provided. This is necessary to confirm that the LTR-MGA method is superior to this approach, based on ancient traditions.

Rice. 2. Upgrade of machine learning algorithms for ranking

for the NDCG metric for test collections: evil – data set Gov03td, right-hand – data set Gov04td

Rice. 3. Evaluation of ranking metrics for the basic ranking formula and algorithms developed by LTR-GA and LTR-MGA

Ranking quality metrics for various machine learning algorithms.

Table 2

Ranking metric Basic ranking function LTR-GA LTR-MGA Metric value shift, %

Accuracy 0.201 0.251 0.267 26.81

[email protected](first 5 documents) 0.149 0.31 0.339 90.47

[email protected](first 10 documents) 0.265 0.342 0.362 29.14

Bpref 0.303 0.316 0.446 51.49

Povnota 0.524 0.542 0.732 39.03

* We have shown the highest values ​​for the specific metric

new genetic algorithm (LTN-OL). The results of the equalization of time expenses for the implementation of algorithms LTYA-OL and LTYA-MOL are shown in the table. 3.

7. Visnovok. Thus, research has shown that when the value of the reviewed metrics is used, the ranking in the IPS increases (on average, it is 19.55% equal to the LTL-OL algorithm). This confirms that YTYA-MOL works correctly and significantly improves the ranking function, so that the desired optimization can be successfully carried out. For additional help with the modified algorithm

Due to the use of the method of local optimization and the introduction of calculations on the quantity of calculation of the goal function, the hour of machine input decreased (on average by 17.71% compared with the traditional genetic algorithm LTNAOL).

An expanded algorithm for machine-learned ranking of LTYA-MOL can be used in IPS to create a ranking model based on a combination of simple functions to rank. However, follow the steps of the exchange until the established approach has solidified. Based on

Estimation of the hour of machine starting of ranking in the length of time depending on the size of the initial selection

Table 3

Size of text collection of documents

Hour of Viconnanny LTR-GA

Hour of Viconnanny LTR-MGA

Changes in the hour of viconnanny, %

Average value

*This color shows the highest values ​​for the size of the test collection

Based on the results, it was revealed that after MO the greatest increase is in the ranking metric, the value of which was taken as the goal function. At the same time, other metrics can cause a gain in value, and in some cases, they can lose their values. Yak is one yak -jacket to the Usuennnya of the Tsooye to the Virishuvati task the task of the optimias yak bagatokoriterelna: Rivnimnoye Kilka the basic meters of the ranking of the dushukovo vidachi, the deputy opitimizuvati one. In addition, in future research it is planned to develop a methodology for the objective function based on a linear handful of basic ranking metrics to speed up the process of information search.

bibliographic list

1. Tie-Yan Liu. Navchannya to Rank for Information Retrieval // Journal Foundations and Trends in Information Retrieval. Vol. 3, issue 3. March 2009. P. 225-331.

2. Christopher J. C. Burges, Tal Shaked, Erin Renshaw. Preparation for the Gradient Descent science // Proceeding ICML "05 Proceedings of the 22nd international conference on Machine learning. 2005. P. 89-96.

3. Semenikhin, S. V. Further investigation of approaches to machine ranking of documents using a search system based on genetic algorithms / S. V. Semenikhin // Russia is young: advanced technologies - in industry. – 2013. – No. 2. – P. 82 – 85.

4. Rich criteria optimization based on genetic algorithms in the synthesis of ceramic systems: monograph. / L. A. Denisova. – Omsk: View of OmDTU, 2014. – 170 p. - ISBN 978-5-8149-1822-2.

5. Denisova, L. A. Automation of parametric synthesis of control systems using a genetic algorithm / L. A. Denisova, V. A. Meshcheryakov // Automation in industry. – 2012. – No. 7. – P. 34 – 38.

6. Huber, Peter J. Robust Estimation of Location Parameter // Annals of Statistics. – 1964. – No. 53. – P. 73-101.

7. Semenikhin, S. V. Automation of information search based on rich criteria optimization and genetic algorithms / S. V. Semenikhin, L. A. Denisova // Dynamics of systems, mechanisms and machines. – 2014. – No. 3. – P. 224 – 227.

8. Tie-Yan Liu, Jun Xu, Tao Qin, Wenying Xiong and Hang Li. LETOR: Benchmark Dataset for transfer to the cloud for cloud recording // SIGIR 2007 Workshop on the way to the object for discussion. – 2007. – P. 3-10.

9. Ageev, M. S. Official metrics ROMIP "2004 / M. S. Ageev, I. E. Kuralenok // II Russian seminar on assessing methods of information search (ROMIP 2004), Pushchino, 2004: tr.; ed. I . S. Nekrestyanova. - St. Petersburg: NDI Chemistry of St. Petersburg State University. - P. 142-150.

10. J. A. Nelder, R. Mead, A simplex method for function minimization, The Computer Journal 7 (1965). 308-313.

SEMENIKHIN Svyatoslav Vitaliyovich, postgraduate student of the Department of Automated Information Processing and Management Systems. Addresses for listing: [email protected] DENISOVA Lyudmila Albertivna, Doctor of Technical Sciences, Associate Professor of the Department of Automated Information Processing and Management Systems. Addresses for listing: [email protected]

Most often, in the practice of a systems analyst, when it comes to becoming an FRD, speech becomes narrower and unformalized. Butt can buti vimogi type:

  • The program is guilty of pratsyuvati shvidko
  • The add-on is responsible for low traffic
  • The video material may be clear.

Such benefits, recorded in the FRD “as it is”, are a painful source of problems in the future. Formalization of such benefits is a constant headache for the analyst. Therefore, the analyst is faced with two methods: initially, an “equivalent” formal method is presented, then the research process (with a deputy, an expert in the subject area, etc.) determines whether such a formal method can be replaced I can go out. They seem to have vanished, we have rejected the role that is not functional; This means not “what” the system is supposed to work, but “how to work.” In this case, “how to work” may be formulated with a specific, clear characteristic.

This is the preamble to the thesis that the systems analyst is responsible for using the mathematical apparatus well and at the same time explaining the “mathematics” to the deputy. Now let's look at the butt.

About the given classification

Let’s say we are writing FRD of a contextual advertising system similar to Amazon Omakase. One of the modules of our upcoming system will be a context analyzer:

The analyzer receives web page text as input and performs contextual analysis. Those who don’t give a damn don’t bother us especially; It is important that at the output we select a set of product categories (which are not indicated in advance). Further, based on these categories, we can display banners, product mailings (like Amazon), etc. For us, the analyzer is still a black screen that we can supply power to (see the text of the document) and remove the output.

The deputy would like the analyzer to “better understand the context.” We need to formulate what this means. Let's talk about the context right now. about the same set of categories that the analyzer rotates. It is possible to determine the specified classification if the document (web page) is given no classes from behind the date; Each class has different product categories. This classification is often used in text processing (for example, spam filters).

Evaluation metrics

Let's take a look at the evaluation metrics that are used to determine the classification. It is acceptable what we know correct categories for a large number of documents. Here is a summary of the current results from our hypothetical analyzer:

  • True positive ( true positives) - those categories that we found were generated and removed at the output
  • Hibno-positive ( false positives) - categories which are not to blame at the output and the analyzer turns them smoothly at the output
  • Hibno-negative ( false negatives) - categories that we found in detail, but the analyzer did not recognize them
  • True negative ( true negatives) - categories that are at the output and at the output of the analyzer, they are also absolutely correct on a daily basis.

We call it a test selection of anonymous documents (pages), for which we know the correct types. How to protect the number of exposures according to skin category (it is important to hit the couples document - category), we remove the canonical sign for the subcategory:

The left column of the table is the “correct” identification of documents and categories (the presence of which is evident in the output), the right column is incorrect. The top row of the table is positive (positive) types of the classifier, the bottom row is negative (in our type, there is a category per type). What is the number of all pairs document - category one N, then it doesn’t matter if you care

With the help of the log, you can now write down the possible deputy at a glance (the number of incorrect responses is equal to zero) and on whom to base your decision. However, in practice, such systems do not exist and the analyzer, obviously, operates with compromises before a test sample. The accuracy metric will help us understand hundreds of reprieves:

For a number cruncher, the diagonal of the matrix is ​​the total number of correct guesses, which is divided into the actual number. For example, an analyzer that gives 9 correct answers out of 10 possible ones gets a score of 90%.

Metric F 1

Let's forgive the lack of accuracy of the accuracy metric because it is important to the brand. It is possible that we would like to highlight a number of mysteries of different brands in the text. Let's look at the task of classification, which will mean what the essence of a given brand is (Timberland, Columbia, Ted Baker, Ralph Lauren, etc.). Otherwise, apparently, we divide the essences in the text into two classes: A – Awesome brand, B – Everything else.

Now let's look at the new classifier, which simply turns class B (Everything else) for whatever essences. For which classifier, the number of true positive responses is more than 0. Seemingly, let's think about the topic, why often when reading a text on the Internet do we hear the same brands? It turns out, it’s not surprising, that at the end of the day 99.9999% of the text not the same brands. Let's create a matrix for the division of responses for a sample of 100,000:

Its accuracy is calculable, which is equal to 99990 / 100000 = 99.99%! Well, we easily came up with a classifier, which, in essence, does not mean anything, but has a large number of correct types. At the same time, it was completely clear that we did not care about the importance of our brand. The truth is that the correct essences in our text are greatly diluted with other words, which do not have any meaning for classification. Looking at this example, it is completely clear that there is a need to use other metrics. For example, meaning tn obviously “smitten” - it actually means the correct answer, not growth tn as a result, the deposit is strongly “bent” tp(how respectful to us) the formula has accuracy.

What is significant to the world of accuracy (P, precision) is:

It is important to note that the degree of accuracy characterizes how many positive responses are rejected from the classifier that are correct. The greater the accuracy, the smaller the number of hits.

However, the degree of accuracy does not indicate that all the correct species are found by turning the classifier. And this is why the world of repetition (R, recall) is called:

The world is again characterized by the ability of the classifier to “guess” the most positive evidence from the findings. Please note that milk-positive strains do not affect this metric.

Precision and Recall allow you to complete the selected characteristics of the classifier, and “from different sources.” When using such systems, you have to constantly balance between two metrics. If you want to move Recall, the problematic classifier is more optimistic, rather than leading to a decrease in Precision through an increase in the number of mildew-positive test results. If you tighten up your classifier, making it more “pessimistic”, for example, by filtering the results, then when Precision increases, Recall immediately drops through the selection of a certain number of correct types dey. Therefore, it is easy to characterize the classifier using one value, the so-called metric F1:

In fact, it is simply the harmonic mean of the values ​​P and R. The metric F 1 reaches its maximum of 1 (100%), since P = R = 100%.
(It doesn’t matter if we figure out that from our generic classifier F 1 = 0). The value of F 1 is one of the broadest metrics for such systems. We will use F1 itself to formulate the threshold intensity of our analyzer in FRD.

The F 1 calculation has two main approaches.

  • Sumarny F 1: the results for all classes are compiled into one single table, followed by the calculation of the F 1 metric.
  • Middle F 1: for each class we form our own contingency matrix and its own value F 1, then we take a simple arithmetic mean for all classes.

Is another method needed? On the right, the sample sizes for different classes may vary. For some classes we may have very few stocks, but for some we have plenty. As a result, the metrics of one “great” class, being reduced to one final table, “score” the resha. In a situation where you want to evaluate the strength of the robotic system more or less equally for all classes, another option is more suitable.

Initial test selection

Most of all, we considered the classification on a single selection, as we know all the varieties. Once we get to the contextual analyzer that we are trying to describe, everything looks a little more complicated.

First of all, we can fix the product categories. The situation, if any value of F 1 is guaranteed, and the set of classes in one’s own can inevitably expand, is practically a dead end. This is further indicated by the selection of categories of fixations.

We calculate the values ​​of F 1 for a given selection from afar. This selection is called initial. However, we do not know how the classifier will behave on these data, which are unknown to us. For these purposes, call the vikorist so called test sample, sometimes called golden set. The difference between the initial and test sample is purely visual: even if there are a small number of butts, we can cut it into the initial and test sample as we please. However, for the self-priming of systems, the formation of the correct initial sample is even more critical. Incorrectly selected buttstocks can jam the robot system.

A typical situation is when the classifier shows a good result on the initial selection and an absolute failure on the test selection. Since our algorithm for classifying bases is machine-learned (to lie from the initial selection), we can evaluate its content behind the folding “floating” scheme. For this purpose, we divide everything that is obvious to us into, say, 10 parts. It turns out that the first part is the same for starting the algorithm; 90% of the stocks that were lost are calculated as a test sample and the F 1 value is calculated. Then we get another part and vikorystvo as in the beginning; another value of F 1 can be removed, etc. From the result we took 10 value F 1. Now we take their arithmetic mean value, which will become the residual result. I repeat, what is this method (so titles are also cross-fold validation) has no sense of algorithms based on machine knowledge.

Turning back to the writing FRD, we note that our situation is very complicated. There is a potential lack of input data collection (all web pages on the Internet) and there is no way to evaluate the context of the page, however people's fate. Thus, our selection can be formed only by hand, and is highly subject to the vagaries of design (and decisions about those who place the page in any category are accepted by people). We can estimate the level of F 1 on different butts, but we cannot recognize F 1 for all Internet sites. Therefore, for potentially unshared sets of data (such as web pages, of which there are many), it is sometimes advisable to use the “unsupervised” method. And therefore, randomly select a number of applications (pages) and from them the operator (person) selects the correct set of categories (classes). Then we can try out the classifier for these selected buttstocks. Further, we respect that the butts we have chosen are typical, we can approximately estimate the accuracy of the algorithm (Precision). In this case, we cannot evaluate Recall (it is unknown how many correct evidence can be found between the applications we have selected), and therefore we cannot calculate i F 1 .

In this way, we want to know how the algorithm operates on all possible input data, the best that can be evaluated in this situation, at the closest value to Precision. If all the relevant values ​​are taken into account, the average values ​​of F 1 can be calculated for this selection.

In the pouch?

And as a result, we will be able to earn something like this:

  1. Record the initial selection. The initial selection will be prompted by the deputy's statement about the “correct” context.
  2. Record the set of categories of our analyzer. Can't we calculate F 1 for an insignificant set of classes?
  3. I can describe it at a glance: The analyzer is responsible for determining the context from the average F1 value of at least 80%.(for example)
  4. Explain this to the deputy.

As a matter of fact, writing FRD for such a system is not easy (especially the last point), but it is possible. While there is no threshold value F 1 in such cases, it can be adjusted to the value F 1 for similar classification tasks.

On the elements in the middle of the skin list. The partial order is determined by the way of specifying the rating for the skin element (for example, “relevant” or “not relevant”; a different gradation or more is possible). Meta ranking models - in the shortest order (in the singing sense) to bring closer and closer the method of ranking in the initial selection on the new data.

The beginning of ranking is still a young field of research, which is rapidly developing, which began in the 2000s due to the emerging interest in the field of information research until the stagnation of machine learning methods before the initial ranking.

Encyclopedic YouTube

  • 1 / 5

    Upon the start of the ranking model and its work, the document-statement pair is translated into a numeric vector with ranking signs (also called ranking factors or signals), which characterize the power of the document, and their mutual relations. Such signs can be divided into three groups:

    Below are several examples of ranking signs that are used in the widely used LETOR dataset:

    • The values ​​of inputs TF, TF-IDF, BM25 and the main model are used to supply different areas of the document (header, URL, body text, sent text);
    • Dovzhini ta IDF-sumi document zones;
    • Document ranks are determined by various variants of such ranking algorithms as PageRank and HITS.

    Metrics of quality ranking

    There are a number of metrics that can be used to evaluate and evaluate the effectiveness of ranking algorithms for selection based on sensory assessments. Often the parameters of a range of models are adjusted to maximize the value of one of these metrics.

    Attach the metrics:

    Classification of algorithms

    In his article “Learning to Rank for Information Retrieval” and presentations at thematic conferences, Tai-Yan Liu from Microsoft Research Asia analyzed the then existing methods for advanced ranking and classified them into three approaches, ID of the input representation, which Vikorist. Given the functions of the fine:

    Point-by-point approach

    Notes

    1. Tie-Yan Liu (2009), Learning to Rank for Information Retrieval, Foundations and Trends in Information Retrieval: Vol. 3: No. 3, p. 225-331, ISBN 978-1-60198-244-5, DOI 10.1561/1500000016. Available slides from T. Liu's speech at the WWW 2009 conference.