What do you need to know about data mining? Data Mining – video data collection. Data Mining Methods

Zasobi Data Mining

Currently, Data Mining technology is represented by a whole range of commercial software products that are widely distributed. A current and regularly updated list of these products can be found on the website www. kdnuggets. com, dedicated to Data Mining. Data Mining software products can be classified based on the same principles, which form the basis for the classification of the technology itself. However, such a classification is not without practical value. As a result of the high competition in the market and the pursuit of technological solutions, many Data Mining products consume literally all aspects of the development of analytical technologies. Therefore, it is more important to classify Data Mining products according to how they are implemented and what potential for integration they provide. Obviously, this is reasonableness, since such a criterion does not allow us to identify clear distinctions between products. However, such a classification has one undeniable advantage. This allows you to highly praise the decision about choosing one or another ready-made solution during the initialization of projects and the analysis of data, the development of systems to support decision making, and the creation of a collection of data, etc.

Therefore, Data Mining products can be mentally divided into three great categories:

    input, as an invisible part, database management systems;

    Libraries of Data Mining algorithms with accompanying infrastructure;

    box and floor solutions (“black boxes”).

Products of the first two categories provide the greatest potential for integration and allow you to realize your analytical potential practically in any application. Boxed programs, in their own way, can provide unique advances in the field of Data Mining or be specialized for any specific area of ​​data mining. However, in most cases it is problematic to integrate them into broader solutions.

The inclusion of analytical capabilities in the warehouse of commercial database management systems is natural and has a great potential trend. It makes sense, since it is not in places where data is concentrated that it is most important to place the features of their processing. Based on this principle, the functionality of Data Mining is immediately implemented in existing commercial databases:

    Microsoft SQL Server;

Main points

  • Intelligent data analysis allows you to automatically, based on a large amount of accumulated data, generate hypotheses that can be verified by other analysis methods (for example, OLAP).

    Data Mining is the discovery and discovery by a machine (algorithms, artificial intelligence) of the raw data of knowledge that was not previously visible, non-trivial, practical and accessible for human interpretation.

    Data Mining methods solve three main problems: the task of classification and regression, the task of finding association rules and the task of clustering. According to the characteristics of the stench, they are divided into descriptions and transfers. The main methods of learning are divided into supervised learning (learning with a teacher) and unsupervised learning (learning without a teacher).

    The given classification and regression are reduced to the value of the stale variable object for its non-stale variable object. If a stale variable produces numerical values, then we talk about a given regression, otherwise - about a given classification.

    p align="justify"> When searching for association rules, the method is to find partial relationships (or associations) between objects or sub-sections. The findings appear as rules and can be discussed as a brief understanding of the nature of the data analysis and the transfer of the data.

    The specified clustering is based on the search for independent groups (clusters) and their characteristics in all data analyzed. This information helps to better understand the data. In addition, grouping similar objects makes it possible to reduce their number and, therefore, simplify analysis.

    Data Mining methods are used in various areas of information technologies: statistics, neural measurements, fuzzy factors, genetic algorithms and others.

    Intelligent analysis includes the following stages: understanding and formulating a given analysis, preparing data for automated analysis, establishing Data Mining methods and generating models, verifying the results of models, and interpreting models by humans.

    Before the Data Mining methods are applied, the output data may need to be processed. The type of transformation lies in the methods that will stagnate.

    Data Mining methods can be effectively used in various areas of human activity: business, medicine, science, telecommunications, etc.

3. Analysis of text information – Text Mining

Analysis of structured information that is stored in databases requires preliminary processing: database design, entering information according to regular rules, placing it in special structures (for example, relational tables), etc. Thus, in order to analyze this information and extract new knowledge from it, it is necessary to spend additional money. In this case, the smell is first linked to the analysis and must be brought to the desired result. Through QCD analysis, structured information is reduced. In addition, not all types of data can be structured without losing valuable information. For example, it is practically impossible to transform text documents into statements on a table without losing the semantics of the text and the meaning between entities. Therefore, such documents are saved in the database without re-creation, like text fields (BLOB fields). However, the text contains a large amount of information, but its lack of structure does not allow data mining algorithms to be used. The main focus of this problem is the analysis of unstructured text. In recent literature, such analysis is called Text Mining.

Methods for analyzing unstructured texts lie in several areas: Data Mining, natural language processing, information searching, information mining and knowledge management.

The meaning of Text Mining: Revealed knowledge in text is a non-trivial process of identifying truly new, potentially interesting and intelligent patterns in unstructured text data.

As a matter of fact, in the context of Data Mining there is a new concept of “unstructured text data”. Such knowledge means a set of documents that logically combines the text without any interference with its structure. Examples of such documents include web pages, email, regulatory documents, etc. n. In general, such documents can be folding and large and include not only text, but also graphic information. Documents that use XML (extensible Markup Language), standard SGML (Standard Generalized Markup Language) and other similar formats for forming text are usually called structured documents. Stinks can be extracted using Text Mining methods.

The process of analyzing text documents can be submitted as a sequence of several words

    Search for information. The first step is to identify which documents are to be analyzed and ensure their availability. As a rule, analysts can select a set of analyzed documents independently - manually, but for a large number of documents it is necessary to use options for automated selection based on specified criteria.

    Processing of documents in advance. This is where the simplest things come to a head, and what is necessary is to manipulate documents to submit them in a way that Text Mining methods work. The method of such transformation is to remove the words and give the text a new form. The report on the forward processing method will be described in section.

    Obtaining information. Obtaining information from selected documents provides insight into the key information that needs to be analyzed.

Vikoristannya of Text Mining methods. This is where the patterns and verses that are found in the texts emerge. This period is fundamental to the process of text analysis and the practical tasks that arise from this process.

Interpretation of results. The remaining part of the process, the revealed knowledge conveys the interpretation of the content of the results. As a rule, the interpretation relies either on presenting the results in a natural way or on their visualization in a graphical form.

Visualization can be used as a complement to text analysis. For this purpose, key concepts are presented that are presented in a graphical form. This approach helps the student to identify the main concepts, as well as to determine their importance.

Front trim to text

One of the main problems in text analysis is the abundance of words in a document. As soon as the skin is analyzed from these words, the search for new knowledge increases sharply and is unlikely to satisfy the benefits of koristuvachev. At the same time, it is obvious that not all words in the text contain useful information. In addition, through the fluidity of natural languages, formally different words (synonyms and so on) actually mean different concepts. Thus, removing uninformative words, as well as bringing words that are close to them into a single form, will significantly speed up the time of text analysis. These problems are resolved at the stage of further processing of the text.

Use the following methods to remove uninformative words and increase the speed of texts:

    View of the brake lights. Stop words are words that are similar and contain little information about the location of the document.

    Stemming-morphological search. It lies near the skin word transformed to its normal form.

    L-gram is an alternative to morphological analysis and selection of stop-slides. Allow the text to be created in a creative way without facing the problem of changing the number of non-informative words;

    Registered. This technique applies to converting all characters to upper or lower case.

The most effective way is to avoid over-insurance methods.

Zavdannya Text Mining

Nina's literature describes a lot of applied tasks that are related to the additional analysis of text documents. These are the classic tasks of Data Mining: classification, clustering, and tasks typical for text documents: automatic annotation, learning keys to understand, etc.

Classification is a standard area of ​​Data Mining. This method is used to assign each document one or several other categories to which the document belongs. The peculiarity of this classification is the assumption that without classifying documents there is no misplacement, so that the skin from the documents corresponds to any given category.

Let us summarize the given classification with the given assignment of the subject of the document.

The method of document clustering is the automatic identification of groups of semantically similar documents among a given fixed number. It is significant that groups are formed solely on the basis of pairwise similarity of document descriptions, and the same characteristics of these groups are specified in advance.

Automatic summarization allows you to shorten the text while saving space. The most important task must be regulated by the system by determining the number of propositional propositions and hundreds of texts that can be seen in relation to the entire text. The result includes the most significant propositions in the text.

The primary method of feature extraction is the identification of facts in the text. In most cases, such concepts include names and nominals: names and nicknames of people, names of organizations, etc. Using learning algorithms, dictionaries can be used to identify specific terms and linguistic patterns for identifying others.

Text-base navigation allows users to navigate documents using these important terms. This is about identifying the keys and understanding the various connections between them.

Trend analysis allows you to identify trends in sets of documents for the current period. A trend can be reversed, for example, by revealing a change in the interests of a company from one market segment to another.

The search for association is also one of the main tasks of Data Mining. To achieve this, for a given set of documents, associations between key concepts are identified.

It is necessary to reach many different types of treatment tasks, and new methods for their improvement. This further confirms the importance of text analysis. Further, in this section, solutions to the upcoming tasks are considered: learning keys to understand, classification, clustering and automatic annotation.

Classification of text documents

The classification of text documents, as well as the classification of objects, applies to the entered document to one of the following classes. Often the classification of text documents is called categorization or rubrication. Obviously, these names resemble the task of systematizing documents into catalogs, categories and headings. In this case, the directory structure can be either single-level or multi-level (hierarchical).

Formally, the given classification of text documents is described by a set of multipliers.

The classification task requires, on the basis of these data, a procedure that falls into the largest known category with a multiplier for the document under investigation.

Most methods for classifying texts are also based on the assumption that documents that fall into one category, place new signs (words and phrases), and the visibility and presence of such The sign on the document confirms its relevance and irrelevance to this or that.

Such an impersonal sign is often called a vocabulary, because it is combined with lexemes, which include words and/or colloquialisms that characterize the category.

Note that these sets of characters indicate the importance of classifying text documents from the classification of objects in Data Mining, which are characterized by a set of attributes.

The decision to upgrade document d to category c is taken on the crossbar

The purpose of classification methods is to select such signs as quickly as possible and formulate rules on the basis of which decisions will be made about classifying a document into a category.

Features of text information analysis

    Zasobi Oracle - Oracle Text2

Beginning with Oracle version 7.3.3, text analysis features are an unknown part of Oracle products. Oracle has expanded and created a new name - Oracle Text - a software complex that integrates into the DBMS, which allows you to effectively process queries that contain unstructured texts. In this case, the processing of the text will be consistent with the capabilities provided by the developer for working with relational databases. Zokrema, when writing programs for processing text, it became possible to use SQL.

The main goal of Oracle Text is to search for documents in their place - for words and phrases that can be combined with a variety of Boolean operations when required. The results of the search are ranked according to significance, according to the frequency of words in the found documents.

    IBM Tools - Intelligent Miner for Text1

The IBM Intelligent Miner for Text product is a set of several utilities that can be launched from the command line or from scripts, regardless of one type. The system combines various utilities to complete the task of analyzing text information.

IBM Intelligent Miner for Text includes a growing number of tools that are based mainly on information retrieval mechanisms, which is specific to each product. The system consists of a number of basic components, which have independent significance between the Text Mining technology:

    Zasobi SAS Institute - Text Miner

The American company SAS Institute has released the SAS Text Miner system to improve the grammar and vocabulary of written language. Text Miner is also universal, as it can work with text documents of various formats – from databases, file systems and even from the web.

Text Miner provides logical text processing within the SAS Enterprise Miner package. This allows data scientists to benefit from the process of data analysis, integrating unstructured text information with essential structured data, such as age, income and the nature of the purchase price.

Main points

    Revealed knowledge in text is a non-trivial process of identifying truly new, potentially interesting and intelligent patterns in unstructured text data.

    The process of analyzing text documents can be carried out as a sequence of several steps: searching for information, forward processing of documents, obtaining information, establishing Text Mining methods, interpreting the results.

    Use the following methods to remove uninformative words and increase the speed of texts: removing stop words, stemming, L-grams, changing the case.

    Tasks for analyzing textual information: classification, clustering, automatic annotation, understanding keys, text navigation, trend analysis, association searches, etc.

    Obtaining key understanding from texts can be seen as both a part of applied science and a part of text analysis. Once the facts are obtained from the text, it becomes necessary to focus on various tasks of analysis.

    The process of learning key understanding using additional patterns consists of two stages: at the first, facts are extracted from text documents using lexical analysis, at the other stage, integration of the obtained facts and/or introduction of new facts.

    Most methods for classifying texts are also based on the assumption that documents that fall into one category, place new signs (words and phrases), and the visibility and presence of such The sign on the document confirms its relevance and irrelevance to this or that.

    More clustering algorithms are used to produce data that looks like a vector space model, which is widely used for informational search and a wikirist metaphor for representing semantic similarity as space proximity.

    There are two main approaches to the automatic annotation of text documents: learning (seeing the most important fragments) and annotating (revising previously collected knowledge).

Visnovok

Intelligent analysis of data is one of the most relevant and demanding areas of applied mathematics. Everyday business processes and production generate large amounts of data, and it is becoming increasingly important for people to interpret and respond to the large amount of data that is dynamically changing in the course of time, without even mentioning the anticipation of critical situations ій. “Intellectual data analysis” is to extract as much useful knowledge as possible from rich, varied, unclear, imprecise, super-articulate, indirect data. It helps to work efficiently, since the data amounts to gigabytes or even terabytes. In addition, there will be algorithms that will begin to provide solutions to various professional problems.

The tools of “Intelligent Data Analysis” protect people from information gain by converting operational data into relevant information so that the necessary actions are taken at the right time.

Applied research is carried out in the following areas: - forecasting in economic systems; automation of marketing research and analysis of client environments for manufacturing, trading, telecommunications and Internet companies; automation of credit decision making and credit risk assessment; monitoring of financial markets; automatic trading systems.

List of references

    “Technology for data analysis: Data Mining. Visual Mining. Text Mining, OLAP" A. A. Barseghyan. M. S. Kupriyanov, V. V. Stenanenko, I. I. Cold. - 2nd type, processed. ta add.

    http://inf.susu.ac.ru/~pollak/expert/G2/g2.htm - Internet article

    http://www.piter.com/contents/978549807257/978549807257_p.pdf -Data analysis technologies

    Degree work >> Bankivska on the right

    Pozichalnik z vikoristannyam cluster, verbal analysis, coorgual coefficients, etc., as well as... the creditworthiness of the employer based on intellectual analysis data Data Mining (z... At the beginning stage analysis be carried out analysis Vlasnikh koshtiv ta...

  1. Analysis and classification of the current market of information systems that implement discretion,

    Abstract >> Computer Science

    1.3 Role demarcation 6 2. Rank analysis different types of systems 7 Operating systems... systems that include: analysis security policy and its characteristics, ... additions or implement more intellectual analysis tributes Until then...

  2. Intellectual The benefits of gifted children in relation to school success

    Degree >> Psychology

    Interrelationship between success and characteristics intellectual development On the platform of the theoretical analysis there were problems with tracking... up to intelligence without analysis yogo psychological structure. Top for evaluation intellectual zdіbnosti є...

Data Mining

Data Mining is a methodology and process of identifying large amounts of data that accumulate in the information systems of companies, previously unknown, non-trivial, practical and accessible for interpretation, knowledge necessary for making decisions spheres of human activity. Data Mining is one of the stages of a larger methodology of Knowledge Discovery in Databases.

The knowledge revealed by the Data Mining process may be non-trivial and previously unknown. The non-triviality conveys such knowledge that it is impossible to know through a simple visual analysis. It is your responsibility to describe the connections between the authorities of business entities, to convey the meaning of some signs from the arrangements of others. The knowledge found will remain stagnant until new objects are developed.

The practical value of knowledge is due to the possibility of its development in the process of supporting management decisions and thorough business activities.

The knowledge may be presented in a way that is understandable to computer scientists, but does not require special mathematical training. For example, it is easiest to understand the human logical construction “this way, that.” Moreover, such rules can be used in various DBMSs as SQL queries. Whenever there are known obscurities for the artist, it is necessary to use post-processing methods that allow them to be brought to a form that can be interpreted.

Data Mining is not just one, but a collection of a great number of different methods for identifying knowledge. All data generated by Data Mining methods can be intelligently divided into six types:

Data Mining is multidisciplinary in nature, including elements of numerical methods, mathematical statistics and theory of probability, information theory and mathematical logic, artificial intelligence and machine learning.

The tasks of business analysis are formulated in different ways, but most of them come down to other tasks of Data Mining or their combinations. For example, risk assessment is the primary task of regression and classification, market segmentation - clustering, population stimulation - associative rules. In fact, Data Mining contains elements, from which it is possible to “select” solutions to most real business problems.

To unlock the most important tasks, various methods and algorithms of Data Mining are used. It is important that Data Mining has developed and is developing on the basis of such disciplines as mathematical statistics, information theory, computer science and databases, it is entirely natural that most algorithms and methods of Data Mining were divided into and various methods of these disciplines. For example, the k-means clustering algorithm is based on statistics.

Piece neural networks, genetic algorithms, evolutionary programming, associative memory, fuzzy logic. Data Mining methods are often referred to as statistical methods(descriptive analysis, correlation and regression analysis, factor analysis, variance analysis, component analysis, discriminant analysis, time series analysis). Such methods, however, make a priori decisions about the analysis of data, which often deviate from the goals Data Mining(Revelation of previously unknown, non-trivial and practically useful knowledge).

One of the most important benefits of Data Mining methods lies in the scientific presentation of calculation results, which allows the use of Data Mining tools by people who may have special mathematical training. At the same time, the stagnation of statistical methods for data analysis emphasizes a good understanding of the probability theory and mathematical statistics.

Enter

Methods of Data Mining (or, for that matter, Knowledge Discovery In Data, briefly, KDD) lie on the basis of databases, statistics and individual intelligence.

Historical digression

The field of Data Mining began with a workshop held by Grigory Pyatetsky-Shapiro in 1989.

Previously, working in the company GTE Labs, Grigory Pyatetsky-Shapiro became interested in the idea of ​​how it is possible to automatically discover song rules to speed up the process of searching large databases. At the same time, two terms were defined - Data Mining ("type of data") and Knowledge Discovery In Data (which is translated as "discovery of knowledge from databases").

Statement of the problem

The starting point is to set it up like this:

  • є the database is large;
  • It is reported that the database has “knowledge acquisition” actions.

It is necessary to develop methods for identifying knowledge acquired from the great efforts of the “sirich” data.

What does “admission of knowledge” mean? Tse mayut buti obov'yazkovo know:

  • previously not visible - such knowledge as may be new (and not confirmed as previously removed);
  • non-trivial - those that cannot be simply calculated (with a thorough visual analysis of data or calculation of simple statistical characteristics);
  • practically korisni - this is the kind of knowledge that becomes valuable for a follower or a companion;
  • accessible for interpretation - such knowledge that is easily visible in the basic form and easily explained in the terms of the subject area.

This largely indicates the essence of Data mining methods and those in which form and in which advanced Data mining technologies are based on database management systems, statistical methods of analysis and methods of piece intelligence.

Data mining and data base

Data mining methods can make it difficult to access large databases. In each specific cutaneous galus, the study is based on its own criterion for the “largeness” of the database.

The development of database technologies initially led to the creation of specialized language - language queries to databases. For relational databases - this is SQL, which gives a wide range of possibilities for creating, changing and extracting data that is saved. Then there was a demand for more analytical information (for example, information about the activities of a business over the past period), and it turned out that traditional relational databases were better suited for example, for conducting operational activities (in enterprises), it is bad to do the analysis. This called, with its own strength, to the creation of the so-called. “collections of data”, the very structure of which in the clearest way demonstrates the conduct of a comprehensive mathematical analysis.

Data mining and statistics

Data mining methods are based on mathematical methods of data processing, including statistical methods. In industrial solutions, such methods are often included in Data mining packages. However, it should be noted that researchers often use parametric tests instead of non-parametric ones to simplify the analysis, and in other words, the results of the analysis are important to interpret, which completely diverge from the goals and objectives of Data mining. Prote, statistical methods are being used, and their stagnation is limited to the conclusion of the early stages of investigation.

Data mining and piece intelligence

The knowledge that is obtained using Data mining methods is usually presented in the video models. How these models act:

  • association rules;
  • wood solution;
  • clusteri;
  • mathematical functions.

Methods for generating such models are brought to the attention of the so-called. “piece intelligence.”

Zavdannya

Knowledge generated by Data Mining methods is usually divided into descriptions. descriptive) ta peredbachuvalni (English) predictive).

In descriptive tasks, the most important thing is to provide a detailed description of the obvious patterns of occurrence, while in transfer tasks the first priority is to supply information about transferring for these types of events, for which there is no data yet.

Until the descriptions of the task lie:

  • search for associative rules and patterns;
  • grouping of objects, cluster analysis;
  • based on a regression model.

Before the transfer of training, the duty is to lie:

  • classification of objects (for later assignments of classes);
  • regression analysis, analysis of time series.

Navigation algorithms

The classification task is characterized by “starting with a teacher”, in which the model is carried out on a selection basis in order to place the input and output vectors.

To set clustering and association, a “start without a reader” is used, in which case the model is carried out on a selection in which there is no output parameter. The values ​​of the output parameter (“apply to a cluster ...”, “similar to a vector ...”) are selected automatically from the start process.

For tasks I will quickly describe it is typical The presence of a subdivision on input and output vectors. Starting from the classic work of K. Pearson on the method of head components, the main attention is paid to the approximation of data.

Etapi navchannya

There is a typical series of stages for solving problems using Data Mining methods:

  1. forming a hypothesis;
  2. Collection of data;
  3. Data preparation (filtration);
  4. select model;
  5. Selection of model parameters and starting algorithm;
  6. model start (automatic search for other model parameters);
  7. Analysis of the cost of the beginning, if there is an unsatisfactory transition to point 5 or point 4;
  8. Analysis of the revealed patterns, such as the unsatisfactory transition of paragraphs 1, 4 and 5.

Data preparation

Before using Data Mining algorithms, it is necessary to prepare a data set. So, since IAD can reveal only the patterns present in the data, the output data from one side must be sufficient for these patterns to be present, and on the other hand, be sufficiently compact for the analysis to be accepted nth hour. Most often, collections or data windows act as output data. Preparation is necessary for analyzing a wealth of data before clustering and data mining.

The cleared data is reduced to sets of characters (or vectors, since the algorithm can only work with vectors of fixed dimension), one set of caution signs. The set of signs is formed according to hypotheses about those signs of raw data that have a high predictive power due to the expansion of the necessary computational effort for processing. For example, a black and white image with a size of 100x100 pixels is 10 thousand. bit of sirikh data. The stench can be transformed into a vector sign with a path revealed in the image of the eyes and mouth. As a result, there will be a change in the data obligation from 10 thousand. bit to the list of formation codes, which means changing the obligation to analyze the data, then, and an hour of analysis.

A number of algorithms can process missing data and have predictive power (for example, the number of purchases a client makes). Let's say, with the use of the method of association rules (English) Russian It is not sign vectors that are formed, but sets of variable dimensions.

The choice of the goal function depends on the method of analysis; Selecting the “right” function is fundamental to successful data mining.

The precautions are divided into two categories - initial set and test set. The initial set is tested to initiate the Data Mining algorithm, and the test set is used to verify the patterns found.

Div. also

  • Imovirnisna neural network of Reshetov

Notes

Literature

  • Paklin N. B., Gorishkov St. I. Business analytics: data to knowledge (CD). - St. Petersburg. : View. Peter, 2009. – 624 p.
  • Duke V., Samoilenko O. Data Mining: Basic Course (CD). - St. Petersburg. : View. Peter, 2001. – 368 p.
  • Zhuravlov Yu.I. , Ryazanov V.V., Senko O.V. ROZIZNAVANNYA. Mathematical methods. software system. Practical stasis. – M.: View. “Phasis”, 2006. – 176 p. - ISBN 5-7036-0108-8
  • Zinov'ev A. Yu. Visualization of a wealth of world data. - Krasnoyarsk: View. Krasnoyarsk State Technical University, 2000. – 180 p.
  • Chubukova I. A. Data Mining: basic guide. – M.: Internet University of Information Technologies: BINOM: Laboratory of Knowledge, 2006. – 382 p. - ISBN 5-9556-0064-7
  • Ian H. Witten, Eibe Frank and Mark A. Hall Data Mining: Practical material Learning Tools and Techniques. - 3rd Edition. – Morgan Kaufmann, 2011. – P. 664. – ISBN 9780123748560

Posilannya

  • Data Mining Software u katalozi posilan Open Directory Project (dmoz).

Wikimedia Foundation. 2010.

We take you to the Data Mining portal – a unique portal dedicated to current Data Mining methods.

Data Mining technologies are a powerful tool for daily business analytics and data tracking to identify patterns and generate transferable models. Data Mining and video production know that it is not based on superficial data, but on real data.

Rice. 1. Scheme of Data Mining

Problem Definition – Statement of the problem: data classification, segmentation, prompting the transfer of models, forecasting.
Data Gathering and Preparation – Collection and preparation of data, cleaning, verification, deletion of duplicate records.
Model Building - Pobudova models, accuracy assessment.
Knowledge Deployment – ​​Establishment of the model for the fulfillment of the assigned task.

Data Mining is used for the implementation of large-scale analytical projects in business, marketing, Internet, telecommunications, industry, geology, medicine, pharmaceuticals and other fields.

Data Mining allows you to launch the process of finding significant correlations and links as a result of sifting through a large array of data using modern methods of pattern recognition and the accumulation of unique analytical technologies, including decision trees classification, clustering, neural measurement methods and others.

The researcher, who first discovered the technology of data processing, is amazed by the large number of methods and effective algorithms that allow him to find approaches to the most important tasks related to the analysis of large quantities of data.

Data Mining can be characterized as a technology intended for the search for great data non-obvious, objective and practical brown patterns.

Data Mining is based on effective methods and algorithms, developed for the analysis of unstructured data of great volume and size.

The key point lies in the fact that the results of great volume and great dimensions are seen in the reduction of structure and ligaments. Meta-technologies of data processing - to reveal structures and recognize patterns where, at first glance, chaos and svaville are in full swing.

The axis of the current butt is based on the data from the pharmaceutical and medical industries.

Interaction of medicines is a growing problem that affects the daily protection of health.

Over the years, the number of drugs (over-the-counter and all supplements) that are prescribed is increasing, causing more and more interactions between drugs that may cause serious side effects And what doctors and patients do not suspect.

This area should be followed up with post-clinical investigations if the skin has already been released to the market and is undergoing intensive treatment.

Clinical studies are carried out before assessing the effectiveness of the drug, but there is little risk of interaction between these drugs and other drugs on the market.

Researchers at Stanford University in California examined the FDA database of side effects of drugs and found that two commonly used drugs are antidepressants. oxetine and pravastatin, which is used to lower cholesterol levels development of diabetes, if you take the course at once.

A similar analysis based on FDA data identified 47 previously unknown adverse interactions.

It’s amazing that because of these observations, a lot of negative effects noted by patients remain undetected. At this time, you must show yourself the greatest decency of this merciless joke.

Upcoming courses from Data Mining Academy StatSoft Data Analysis in 2020

We are starting to get acquainted with Data Mining, vikoryst and the miracle videos of the Academy of Data Analysis.

Be sure to watch our videos and you will understand what Data Mining is!

Video 1. What is Data Mining?


Video 2. Review of data processing methods: decision trees, advanced transfer models, clustering and much more

JavaScript is disabled in your browser


First we launch the final project, we can organize the process of separating data from external sources, and we will now show you how to do it.

The video will introduce you to unique technology STATISTICA In-place database processing and the link between Data Mining and real data.

Video 3. The order of interaction with databases: graphical interface for prompting SQL queries, In-place database processing technology

JavaScript is disabled in your browser


We are now aware of interactive drilling technologies, which are effective in conducting exploratory data analysis. The term drilling itself combines the technology of Data Mining with geological exploration.

Video 4. Interactive drilling: Exploratory and graphical methods for interactive data tracking

JavaScript is disabled in your browser


Now we are familiar with the analysis of association rules, whose algorithms allow us to find connections that are in real data. The key point is the effectiveness of algorithms in large data sets.

The result of link analysis algorithms, for example the Apriori algorithm, is the discovery of link rules for tracking objects with a given reliability, for example, 80%.

In geology, these algorithms can be hindered during the exploration of brown copalins, for example, as a sign of connections with the signs B and C.

You can find out specific applications of such solutions through our efforts:

In various trades, the Apriori algorithm and its modifications are allowed to track the connections of various goods, for example, during the sale of perfumes (perfumes - varnish - mascara for everything else) and products of various brands.

Analysis of certain sections on the site can also be effectively carried out using additional association rules.

Hey, know about the upcoming video.

Video 5. Rules of association

JavaScript is disabled in your browser

Let’s try to apply Data Mining in specific areas.

Internet trading:

  • analysis of the trajectory of buyers from visiting the site before purchasing goods
  • assessment of the effectiveness of service, analysis of issues related to a variety of products
  • bundles of goods such as supplies to distributors

Retail trade: analysis of information about buyers based on credit cards, discount cards, etc.

Typical tasks of separate trading that are managed by Data Mining:

  • analysis of the cupivel cat;
  • door for transfer of models and the classification models of buyers and goods that are purchased;
  • creation of buyer profiles;
  • CRM; assessing the loyalty of buyers of various categories; planning loyalty programs;
  • time series tracking and time deposits, visibility of seasonal factors, assessment of the effectiveness of advertising campaigns on a wide range of real data.

The telecommunications sector is discovering the need to interchange the possibilities of data acquisition methods, as well as current big data technologies:

  • classification of clients based on the key characteristics of calls (frequency, severity, etc.), SMS frequency;
  • identification of customer loyalty;
  • The significance of the Shahraism is that

Insurance:

  • analysis of rizu. By identifying factors related to paid claims, insurers can change their expenses for claims. It's a shame that the insurance company has discovered that the amount paid for the claims of friends will double the amount paid for the claims of single people. The company responded by revising its discount policy for family clients.
  • manifestation of shahrayism. Insurance companies can reduce the rate of fraud, due to the stereotypes in insurance claims that characterize interactions between lawyers, doctors and applicants.

A more practical presentation of the data and more specific tasks is presented in the following video.

Webinar 1. Webinar “Practical Data Mining: Problems and Solutions”

JavaScript is disabled in your browser

Webinar 2. Webinar "Data Mining and Text Mining: application of real tasks"

JavaScript is disabled in your browser


More in-depth knowledge of the methodology and technology of data processing can be obtained in StatSoft courses.

These elements of individual intelligence are actively promoted by practical managers. Instead of traditional systems of piece intelligence, the technology of intelligent search and analysis of data or “data mining” (Data Mining - DM) does not attempt to model natural intelligence, but rather makes it possible to numerical servers, search engines and data warehouses. Often the words “Data Mining” are followed by the words “Knowledge Discovery in Databases”.

Rice. 6.17.

Data Mining is the process of identifying previously unknown, non-trivial, practical and easily interpretable knowledge in existing data that requires solutions in various areas of human activity. Data Mining is of great value to quarryers and analysts in their day-to-day activities. Business people have learned that using additional Data Mining methods can reap significant competitive advantages.

The current technology of Data Mining (Discovery-driven Data Mining) is based on the concept of Patterns, which display fragments of richly dimensional relationships in data. These patterns are patterns governing data selections that can be compactly expressed in a mature human form. The search for patterns is carried out using methods that are not limited by a priori assumptions about the structure of the sample and the type of distribution of the values ​​of the analyzed indicators. In Fig. Figure 6.17 shows a diagram of the transformation of data using the technology of Data Mining.

Rice. 6.18.

The basis for all forecasting systems is historical information that is stored in the database as time series. Since it is possible to create patterns that adequately demonstrate the dynamics of behavior of target indicators, it is possible that the behavior of the system can be transferred to the future. In Fig. Figure 6.18 shows the latest cycle of stagnation of the Data Mining technology.

An important point of Data Mining is the non-triviality of the templates that need to be discussed. This means that patterns found are guilty of displaying non-obvious, unexpected regularity of data, which becomes so-called Hidden Knowledge. Business people have come to understand that “raw data” reveals a deep layer of knowledge, and with proper excavation, relevant nuggets can be revealed that can be used in competition.

The sphere of Data Mining is not limited by anything - the technology can be applied here, where the quantity of any “raw” data is great!


Before us, Data Mining methods were used by commercial enterprises that develop projects based on data warehouses (Data Warehousing). Evidence from many enterprises shows that the yield from Data Mining can reach 1000%. We are aware of the economical effect, which is 10-70 times after transferring cob costs of 350 to 750 thousand. dolars And information about a project worth 20 million dollars, which paid for itself in just 4 months. Another butt - river savings of 700 thousand. dollars for the development of Data Mining in one of the supermarkets in Great Britain.

Microsoft has officially announced the strengthening of its activity in Data Mining. A special follow-up group from Microsoft, according to Osama Fayyad, and six requested partners (Angoss, Datasage, Epiphany, SAS, Silicon Graphics, SPSS) are preparing a comprehensive project to develop data exchange standards and features for integration Data Mining tools with databases and collections.

Data Mining is a multidisciplinary field that is developed on the basis of applied statistics, pattern recognition, artificial intelligence methods, database theory, etc. (Fig. 6.19). A summary of methods and algorithms implemented in various working Data Mining systems. [Duke V.A. www.inftech.webservis.ru/it/datamining/ar2.html]. There are many ways to integrate with such systems. Prote, say, the skin system has a key component that is responsible for its main purpose.

There are five standard types of patterns that appear behind other Data Mining methods: association, sequence, classification, clustering and prediction.

Rice. 6.19. Areas of stagnation of Data Mining technology

The association takes place every time, as many are connected with each other. For example, a study conducted in a computer supermarket can show that 55% of computers that were purchased also included a printer or scanner, and due to the fact that there are discounts for such a set, 80% of those who purchased the printer purchased the printer. With their knowledge of such an association in place, managers can easily assess how much of a difference they are hoping for.

Since the lance is knitted every hour of the day, then we talk about consistency. So, for example, after buying a cabin, 45% of new households buy a new stove over the course of a month, and between the two years, 60% of new residents get a refrigerator.

Following this classification, signs are revealed that characterize the group that is related to another object. It is important to analyze the already classified objects and formulate a certain set of rules.

Clustering is differentiated from classification in that groups are not specified further. With the help of additional clustering, Data Mining independently sees different similar groups of data.