Coursework: OLAP Technology. Multidimensional data representation. The general scheme of the organization of the data warehouse. Characteristics, types and main differences between OLAP and OLTP technologies. Star and snowflake schemes. Aggregation olap technology functions

The aim of the course work is to study OLAP technology, the concept of its implementation and structure.

In the modern world, computer networks and computing systems make it possible to analyze and process large amounts of data.

A large amount of information greatly complicates the search for solutions, but makes it possible to obtain much more accurate calculations and analysis. To solve this problem, there is a whole class of information systems that perform analysis. Such systems are called decision support systems (DSS) (DSS, Decision Support System).

To perform the analysis, the DSS must accumulate information, having the means of its input and storage. In total, there are three main tasks solved in the DSS:

· data input;

· data storage;

· data analysis.

Data entry into the DSS is carried out automatically from sensors that characterize the state of the environment or process, or by a human operator.

If data input is carried out automatically from sensors, then data is accumulated by a readiness signal that appears when information appears or by cyclic polling. If the input is carried out by a person, then they should provide users with convenient means for entering data, checking them for correct input, as well as performing the necessary calculations.

When entering data simultaneously by several operators, it is necessary to solve the problems of modification and parallel access of the same data.

DSS provides analytics with data in the form of reports, tables, graphs for study and analysis, which is why such systems provide decision support functions.

In data entry subsystems called OLTP (On-linetransactionprocessing), operational data processing is implemented. For their implementation, conventional database management systems (DBMS) are used.

The analysis subsystem can be built on the basis of:

· Subsystems of information retrieval analysis based on relational DBMS and static queries using the SQL language;

· Subsystems of operational analysis. To implement such subsystems, the OLAP online analytical data processing technology is used, using the concept of multidimensional data presentation;

· Subsystems of intellectual analysis. This subsystem implements DataMining methods and algorithms.

From the user's point of view, OLAP systems provide a means of flexible viewing of information in various slices, automatic obtaining of aggregated data, performing analytical operations of convolution, detailing, comparison over time. Thanks to all this, OLAP systems are a solution with great advantages in the field of data preparation for all types of business reporting, involving the presentation of data in different sections and different levels of hierarchy, such as sales reports, various forms of budgets, and others. OLAP systems have great advantages of such a presentation in other forms of data analysis, including forecasting.

1.2 Definition OLAP-systems

The technology for complex multivariate data analysis is called OLAP. OLAP is a key component of an HD organization.

OLAP functionality can be implemented in various ways, both the simplest, such as data analysis in office applications, and more complex - distributed analytical systems based on server products.

OLAP (On-LineAnalyticalProcessing) is a technology for on-line analytical data processing using tools and methods for collecting, storing and analyzing multidimensional data and for supporting decision-making processes.

The main purpose of OLAP systems is to support analytical activities, arbitrary requests from analyst users. The purpose of OLAP analysis is to test emerging hypotheses.

The conditions of high competition and the growing dynamics of the external environment dictate increased requirements for enterprise management systems. The development of management theory and practice was accompanied by the emergence of new methods, technologies and models aimed at improving the efficiency of activities. Methods and models, in turn, contributed to the emergence of analytical systems. The demand for analytical systems in Russia is high. These systems are most interesting from the point of view of application in the financial sector: banks, insurance business, investment companies. The results of the work of analytical systems are necessary primarily for people, on whose decision the development of the company depends: managers, experts, analysts. Analytical systems allow solving the problems of consolidation, reporting, optimization and forecasting. Until now, the final classification of analytical systems has not developed, as well as there is no general system of definitions in terms used in this direction. The information structure of an enterprise can be represented by a sequence of levels, each of which is characterized by its own way of processing and managing information, and has its own function in the management process. Thus, analytical systems will be located hierarchically at different levels of this infrastructure.

Transactional systems layer

Data warehouse tier

Data mart layer

OLAP level - systems

Analytical Application Layer

OLAP - systems - (OnLine Analytical Processing, analytical processing in real time) - are a technology for complex multidimensional data analysis. OLAP - systems are applicable where there is a task of analyzing multifactor data. They are an efficient tool for analysis and report generation. The data warehouses, data marts and OLAP systems discussed above are classified as business intelligence systems (Business Intelligence, BI).

Very often, information and analytical systems created with the expectation of direct use by decision-makers are extremely easy to use, but severely limited in functionality. Such static systems are called Executive Information Systems (EIS) in the literature. They contain predefined sets of queries and, being sufficient for day-to-day review, are unable to answer all the questions about the available data that may arise when making decisions. As a rule, the result of the work of such a system is multi-page reports, after a thorough study of which the analyst has a new series of questions. However, each new request not foreseen in the design of such a system must first be formally described, coded by the programmer, and only then executed. The waiting time in this case can be hours and days, which is not always acceptable. Thus, the external simplicity of static DSSs, for which most customers of information and analytical systems are actively fighting, turns into a catastrophic loss of flexibility.

Dynamic DSS, on the other hand, is focused on processing ad hoc analyst requests for data. The requirements for such systems were considered most deeply by E. F. Codd in the article that laid the foundation for the concept of OLAP. Analysts work with these systems in an interactive sequence of forming queries and studying their results.

But dynamic DSSs can operate in more than just the realm of online analytical processing (OLAP); support for making management decisions based on accumulated data can be carried out in three basic areas.

Detailed data sphere. This is the domain of most information retrieval systems. In most cases, relational DBMSs do an excellent job with the tasks that arise here. The generally accepted standard for the language of relational data manipulation is SQL. Information retrieval systems providing an interface end user in tasks of searching for detailed information, can be used as add-ons as above separate bases data of transactional systems, and over a common data warehouse.

Sphere of aggregates. A comprehensive look at the information collected in the data warehouse, its generalization and aggregation, hypercube representation and multidimensional analysis are the tasks of online analytical data processing (OLAP) systems. Here you can either focus on special multidimensional DBMS, or stay within the framework of relational technologies. In the second case, pre-aggregated data can be collected in a star-shaped database, or the information can be aggregated on the fly while scanning detailed tables of a relational database.

The sphere of regularities. Intellectual processing is carried out by methods of data mining (IAD, Data Mining), the main tasks of which are the search for functional and logical patterns in the accumulated information, the construction of models and rules that explain the found anomalies and / or predict the development of some processes.

Prompt analytical data processing

The OLAP concept is based on the principle of multidimensional data presentation. In a 1993 article by EF Codd, he examined the shortcomings of the relational model, first of all pointing out the impossibility "to combine, view and analyze data in terms of multiple dimensions, that is, in the most understandable way for corporate analysts", and identified the general requirements for OLAP systems that extend functionality of relational DBMS and includes multivariate analysis as one of its characteristics.

Classification of OLAP products according to the way data is presented.

Currently, there are a large number of products on the market that provide OLAP functionality to one degree or another. About 30 of the most famous are listed on the overview Web server http://www.olapreport.com/. Providing a multidimensional conceptual view from the outside user interface to the source database, all OLAP products are divided into three classes according to the type of the source database.

The earliest online analytical processing systems (e.g. Essbase from Arbor Software, Oracle Express Server Oracle) belonged to the MOLAP class, that is, they could work only with their own multidimensional databases. They are based on proprietary multidimensional DBMS technologies and are the most expensive. These systems provide a full cycle of OLAP processing. They either include, in addition to the server component, their own integrated client interface, or use external spreadsheet programs to communicate with the user. To maintain such systems, a special staff of employees is required to install, maintain the system, and form data representations for end users.

Relational On-Line Analytical Processing (ROLAP) systems allow you to represent data stored in a relational database in a multidimensional form, providing the transformation of information into a multidimensional model through an intermediate metadata layer. ROLAP systems are well suited to work with large storage facilities. Like MOLAP systems, they require significant IT maintenance and are multi-user.

Finally, hybrid systems (Hybrid OLAP, HOLAP) are designed to combine the advantages and minimize the disadvantages inherent in the previous classes. This class includes Speedware's Media / MR. According to the developers, it combines the analytical flexibility and responsiveness of MOLAP with the constant access to real data inherent in ROLAP.

Multidimensional OLAP (MOLAP)

In specialized DBMSs based on multidimensional data representation, data is organized not in the form of relational tables, but in the form of ordered multidimensional arrays:

1) hypercubes (all cells stored in the database must have the same dimension, that is, be in the most complete basis of measurements) or

2) polycubes (each variable is stored with its own set of measurements, and all associated processing difficulties are shifted to the internal mechanisms of the system).

The use of multidimensional databases in on-line analytical processing systems has the following advantages.

In the case of using a multidimensional DBMS, the search and selection of data is much faster than with a multidimensional conceptual view of a relational database, since the multidimensional database is denormalized, contains pre-aggregated indicators and provides optimized access to the requested cells.

Multidimensional DBMS easily cope with the tasks of inclusion in information model a variety of built-in functions, while the objectively existing limitations of the SQL language make it quite difficult and sometimes impossible to perform these tasks on the basis of relational DBMS.

On the other hand, there are significant limitations.

Multidimensional DBMSs do not allow working with large databases. In addition, due to denormalization and previously performed aggregation, the amount of data in a multidimensional database, as a rule, corresponds (according to Codd) to 2.5-100 times less than the volume of the original detailed data.

Multidimensional DBMSs use external memory very inefficiently compared to relational ones. In the overwhelming majority of cases, the information hypercube is highly sparse, and since the data is stored in an ordered form, undefined values can be removed only by choosing the optimal sort order that allows organizing the data into the largest possible contiguous groups. But even in this case, the problem is only partially solved. In addition, the sort order that is optimal for storing sparse data is likely to be different from the order that is most often used in queries. Therefore, in real systems, you have to find a compromise between performance and redundancy of disk space occupied by the database.

Therefore, the use of multidimensional DBMS is justified only under the following conditions.

The volume of initial data for analysis is not too large (no more than several gigabytes), that is, the level of data aggregation is quite high.

The set of information dimensions is stable (since any change in their structure almost always requires a complete restructuring of the hypercube).

The system response time to ad hoc requests is the most critical parameter.

Extensive use of complex built-in functions is required to perform cross-dimensional calculations on cells of a hypercube, including the ability to write custom functions.

Relational OLAP (ROLAP)

Direct use of relational databases in online analytical processing systems has the following advantages.

In most cases, corporate data warehouses are implemented using relational DBMS tools, and ROLAP tools allow you to perform analysis directly on them. At the same time, the storage size is not such a critical parameter as in the case of MOLAP.

In the case of a variable dimension of the problem, when changes in the structure of dimensions have to be made quite often, ROLAP systems with a dynamic representation of the dimension are optimal solution, since such modifications in them do not require a physical reorganization of the database.

Relational DBMSs provide a significantly higher level of data protection and good opportunities for differentiating access rights.

The main disadvantage of ROLAP compared to multidimensional DBMS is lower performance. Relational systems require extensive database schema and index tuning to achieve performance comparable to MOLAP, which means a lot of effort on the part of DBAs. Only by using star schemas can the performance of well-tuned relational systems be close to the performance of multidimensional database systems.

Online analytical processing, or OLAP, is an effective data processing technology, as a result of which summary information is displayed on the basis of huge arrays of all kinds of data. It is a powerful product that helps you access, retrieve and view information on a PC, analyzing it from different perspectives.

OLAP is a tool that provides a strategic position for long-term planning and considers the basic information of operational data in the future 5, 10 or more years. The data is stored in a database with a dimension that is their attribute. Users can view the same dataset with different attributes, depending on the purpose of the analysis.

OLAP history

OLAP is not a new concept and has been in use for decades. In fact, the origin of the technology has been traced back to 1962. But the term was only coined in 1993 by database author Ted Coddom, who also set 12 rules for the product. As with many other applications, the concept has undergone several evolutionary stages.

The history of OLAP technology itself dates back to 1970, when the Express content and the first Olap server were released. They were acquired by Oracle in 1995 and subsequently became the backbone of the online analytical processing of the multidimensional computing engine that the well-known computer brand provided in its database. In 1992, another well-known online analytical processing product, Essbase, was released by Arbor Software (acquired by Oracle in 2007).

In 1998, Microsoft released MS Analysis Services online data processing server. This contributed to the popularity of the technology and prompted the development of other products. Today there are several world renowned suppliers offering Olap applications, including IBM, SAS, SAP, Essbase, Microsoft, Oracle, IcCube.

Online analytical processing

OLAP is a tool that allows you to make decisions about scheduled events. Atypical Olap calculation can be more complex than simple data aggregation. Analytical Queries Per Minute (AQM) is used as a standard benchmark for comparing the performance of different instruments. These systems should hide users from the syntax of complex queries as much as possible and provide consistent response times for everyone (no matter how complex they are).

There are the following main characteristics of OLAP:

Multidimensional data representations.
Support for complex calculations.
Temporary reconnaissance.

The multidimensional view provides the foundation for analytic processing through flexible access to corporate data. It allows users to analyze data in any dimension and at any level of aggregation.

Support for complex calculations is the backbone of OLAP software.

Time intelligence is used to assess the performance of any analytical application over a period of time. For example, this month versus last month, this month versus the same month last year.

Multidimensional data structure

One of the main characteristics of online analytical processing is the multidimensional data structure. A cube can have multiple dimensions. With this model, the entire OLAP mining process is easy for managers and executives, since the objects represented in the cells are real-world business objects. In addition, this data model allows users to handle not only structured arrays, but also unstructured and semi-structured ones. All this makes them especially popular for data analysis and BI applications.

The main characteristics of OLAP systems:

Use multidimensional data analysis methods.
Provides advanced database support.
Create easy-to-use end-user interfaces.
Supports client / server architecture.

One of the main components of OLAP concepts is the client-side server. In addition to aggregating and preprocessing data from a relational database, it provides advanced calculation and recording options, additional functionality, basic advanced query capabilities, and more.

Different data models and tools are available, depending on the sample application that the user chooses, including real-time alerting, what-if scripting, optimization, and complex OLAP reports.

Cubic shape

The concept is based on a cubic shape. The location of the data in it shows how OLAP adheres to the principle of multidimensional analysis, as a result of which a data structure is created that is designed for fast and efficient analysis.

An OLAP cube is also referred to as a "hypercube". It is described as consisting of numerical facts (measures) classified by facets (dimensions). Dimensions refer to attributes that define a business problem. Simply put, a dimension is a label that describes a measure. For example, in sales reports, the measure would be the sales volume, and the dimensions would include the sales period, salesperson, product or service, and sales region. In the reporting of manufacturing operations, the measure can be total production costs and units of output. The dimensions will be the date or time of production, the stage of production or phase, even the workers involved in the production process.

The OLAP data cube is the cornerstone of the system. The data in the cube is organized using either a star or a snowflake schema. In the center there is a fact table containing aggregates (measures). It is linked to a number of dimension tables containing measure information. Dimensions describe how these measures can be analyzed. If a cube contains more than three dimensions, it is often referred to as a hypercube.

One of the main functions of a cube is its static nature, which means that the cube cannot be changed after it has been designed. Therefore, the process of assembling the cube and setting up the data model is a critical step towards appropriate data processing in the OLAP architecture.

Combining data

The use of aggregations is the main reason why queries are processed much faster in OLAP tools (compared to OLTP). Aggregations are summaries of data that have been pre-calculated during processing. All members stored in OLAP dimension tables define the queries that the cube can receive.

In a cube, accumulations of information are stored in cells, the coordinates of which are set by specific dimensions. The number of aggregates that a cube can contain depends on all possible combinations of dimension members. Therefore, a typical cube in an application can contain an extremely large number of aggregates. Only key aggregates that are distributed across the entire analytic cube of online analytics will be pre-calculated. This will drastically reduce the time it takes to define any aggregations when executing a query on the data model.

There are also two aggregation-related options that can improve the performance of the finished cube: create a capability cache aggregation, and use aggregation based on the analysis of user queries.

Principle of operation

Typically, analysis of operational information obtained from transactions can be performed using a simple spreadsheet (data values are represented in rows and columns). This is a good thing given the two-dimensional nature of the data. In the case of OLAP, there are differences due to the multidimensional data set. Because they are often obtained from different sources, the spreadsheet may not always process them efficiently.

The cube solves this problem and also keeps the OLAP data warehouse running in a logical and orderly way. Businesses collect data from multiple sources and are presented in a variety of formats such as text files, multimedia files, Excel spreadsheets, Access databases, and even OLTP databases.

All data is collected in a repository filled directly from sources. In it, the raw information obtained from OLTP and other sources will be cleaned of any erroneous, incomplete and inconsistent transactions.

After cleaning and transformation, the information will be stored in a relational database. It will then be uploaded to a multidimensional OLAP server (or Olap cube) for analysis. End users responsible for business applications, data mining and other business operations will have access to the information they need from the Olap cube.

Benefits of the array model

OLAP is a tool that delivers fast query performance through optimized storage, multidimensional indexing and caching, which are significant system benefits. In addition, the benefits are:

Smaller data size on disk.
Automated computation of higher data level aggregates.
Array models provide natural indexing.
Efficient data extraction is achieved through pre-structuring.
Compactness for low-dimensional datasets.

The disadvantages of OLAP include the fact that some decisions (processing step) can be quite lengthy, especially with large amounts of information. This is usually corrected by performing only incremental processing (examining the data that has changed).

Basic analytical operations

Convolution(roll-up / drill-up) is also known as “consolidation”. Convolution involves collecting all data that can be retrieved and calculating all in one or more dimensions. More often than not, this may require the application of a mathematical formula. As an OLAP example, consider a retail network with outlets in different cities. To identify patterns and anticipate future sales trends, data about them from all points is “rolled up” to the main sales department of the company for consolidation and calculation.

Disclosure(drill-down). This is the opposite of curdling. The process starts with a large dataset and then breaks down into smaller chunks, thereby allowing users to view the details. In the retail chain example, an analyst will analyze sales data and view individual brands or products that are considered best sellers in each of the outlets in different cities.

Cross section(Slice and dice). This is a process where analytical operations involve two steps: take a specific set of data from an OLAP cube (the "cutting" aspect of the analysis) and view it from different points of view or angles. This can happen when all point of sale data is received and entered into the hypercube. The analyst cuts a set of sales data from the OLAP Cube. It will then be reviewed when analyzing the sales of individual units in each region. During this time, other users can focus on evaluating the cost-effectiveness of sales or evaluating the effectiveness of a marketing and advertising campaign.

Turn(Pivot). It rotates the data axes to replace the presentation of information.

Varieties of databases

Basically, this is a typical OLAP cube that implements analytical processing of multidimensional data using an OLAP Cube or any data cube so that the analytical process can add dimensions as needed. Any information loaded into a multidimensional database will be stored or archived and can be recalled when required.

	Meaning
Relational OLAP (ROLAP)	ROLAP is an advanced DBMS along with multidimensional data mapping to perform standard relational operation
Multidimensional OLAP (MOLAP)	MOLAP - implements work in multidimensional data
Hybrid Online Analytical Processing (HOLAP)	In the HOLAP approach, aggregated totals are stored in a multidimensional database and detailed information is stored in a relational database. This ensures both the efficiency of the ROLAP model and the performance of the MOLAP model.
OLAP Desktop (DOLAP)	In Desktop OLAP, a user downloads a piece of data from a database locally or to their desktop and analyzes it. DOLAP is relatively cheaper to deploy as it offers very little functionality compared to other OLAP systems
Web OLAP (WOLAP)	Web OLAP is an OLAP system accessible through a web browser. WOLAP is a three-tier architecture. It consists of three components: client, intermediate software and database server
Mobile OLAP	Mobile OLAP helps users get and analyze OLAP data using their mobile devices
Spatial OLAP	SOLAP is created to facilitate the management of both spatial and nonspatial data in a geographic information system (GIS)

There are lesser known OLAP systems or technologies, but these are the main ones that are currently used large corporations, business structures and even government.

OLAP tools

Online analytical processing tools are very well represented on the Internet in both paid and free versions.

The most popular ones are:

Dundas BI from Dundas Data Visualization is a browser-based business analyst and data visualization platform that includes integrated dashboards, OLAP reporting and data analytics.
Yellowfin is a business intelligence platform that is a single, integrated solution designed for companies of all industries and sizes. This system is customizable for businesses in the fields of accounting, advertising, agriculture.
ClicData is a Business Intelligence (BI) solution designed for use primarily by small and medium-sized businesses. The tool allows end users to create reports and dashboards. Board is designed to integrate business intelligence, corporate performance management, and is a full-featured system that serves mid-sized and enterprise-level companies.
Domo is a cloud-based business management suite that integrates with multiple data sources including spreadsheets, databases, social media and any existing cloud or on-premises software solution.
InetSoft Style Intelligence is a business analyst software platform that enables users to create dashboards, visual OLAP analysis technology, and mashup reports.
Birst by Infor Company is a web-based business analyst and analysis solution that brings insights from different teams together to help you make informed decisions. The tool allows decentralized users to scale up the corporate team model.
Halo is a comprehensive supply chain management and business intelligence system that assists in business planning and inventory forecasting for supply chain management. The system uses data from all sources - large, small and intermediate.
Chartio is a cloud-based business analyst solution that provides founders, business teams, data analysts, and product teams with the tools to organize their day-to-day work.
Exago BI is a web-based solution designed to be embedded in web applications. The implementation of Exago BI enables companies of all sizes to provide their clients with ad-hoc, operational and interactive reporting.

Business Impact

The user will find OLAP in most business applications across different industries. The analysis is used not only by the business, but also by other stakeholders.

Some of its most common applications include:

Marketing OLAP data analysis.
Financial reporting, which covers sales and expenses, budgeting and financial planning.
Business process management.
Sales analysis.
Database marketing.

The industries continue to grow, which means that users will see more OLAP applications soon. Multivariate tailored processing provides more dynamic analysis. It is for this reason that these OLAP systems and technologies are used to evaluate what-if scenarios and alternative business scenarios.

The use of OLAP systems allows you to automate the strategic level of organization management. OLAP (Online Analytical Processing) is a powerful data processing and research technology. Systems built on the basis of OLAP technology provide almost limitless possibilities for drawing up reports, performing complex analytical calculations, building forecasts and scenarios, and developing many options for plans.

Full-fledged OLAP systems appeared in the early 90s, as a result of the development of decision support information systems. They are designed to transform various, often disparate, data into useful information. OLAP systems can organize data according to a set of criteria. It is not necessary that the criteria have clear characteristics.

OLAP systems have found their application in many issues of strategic management of an organization: business performance management, strategic planning, budgeting, development forecasting, preparation of financial statements, work analysis, simulation of the external and internal environment of the organization, data storage and reporting.

OLAP system structure

The OLAP system is based on the processing of multidimensional data arrays. Multidimensional arrays are designed so that each element of the array has many connections with other elements. To form a multidimensional array, the OLAP system must receive source data from other systems (for example, ERP or CRM systems), or through external input. The user of the OLAP system receives the necessary data in a structured form in accordance with his request. Based on the specified order of actions, you can imagine the structure of the OLAP system.

In general, the structure of an OLAP system consists of the following elements:

database . The database is the source of information for the operation of the OLAP system. The type of database depends on the type of OLAP system and OLAP server algorithms. As a rule, relational databases, multidimensional databases, data warehouses, etc. are used.
OLAP server. It provides management of the multidimensional data structure and the relationship between the database and users of the OLAP system.
custom applications... This element of the OLAP system structure manages user requests and generates the results of accessing the database (reports, graphs, tables, etc.)

Depending on the way of organizing, processing and storing data, OLAP systems can be implemented on local computers of users or using dedicated servers.

There are three main ways of storing and processing data:

locally. The data is placed on users' computers. Data processing, analysis and management are performed at local workplaces. This OLAP system structure has significant drawbacks associated with the speed of data processing, data security and limited use of multivariate analysis.
relational databases... These databases are used when the OLAP system works together with a CRM system or an ERP system. The data is stored on the server of these systems in the form of relational databases or data warehouses. The OLAP server refers to these databases to form the necessary multidimensional structures and conduct analysis.
multidimensional databases... In this case, the data is organized as a special data store on a dedicated server. All data operations are carried out on this server, which transforms the original data into multidimensional structures. Such structures are called OLAP cube. Relational databases and / or client files are data sources for forming an OLAP cube. The data server carries out preliminary preparation and processing of data. OLAP server works with OLAP cube without having direct access to data sources (relational databases, client files, etc.).

Types of OLAP systems

Depending on the method of storing and processing data, all OLAP systems can be divided into three main types.

1. ROLAP (Relational OLAP - relational OLAP systems) - this kind of OLAP system works with relational databases. Data is accessed directly into a relational database. The data is stored in the form of relational tables. Users have the ability to carry out multidimensional analysis as in traditional OLAP systems. This is achieved through the use of SQL tools and special requests.

One of the advantages of ROLAP is the ability to process large amounts of data more efficiently. Another advantage of ROLAP is the ability to efficiently process both numeric and text data.

The disadvantages of ROLAP include low performance (compared to traditional OLAP systems), because the data is processed by the OLAP server. Another disadvantage is the limited functionality due to the use of SQL.

2. MOLAP (Multidimensional OLAP - multidimensional OLAP systems). This type of OLAP systems belongs to traditional systems. The difference between a traditional OLAP system and other systems is in the preliminary preparation and optimization of data. These systems, as a rule, use a dedicated server on which data preprocessing is carried out. The data is formed into multidimensional arrays - OLAP cubes.

MOLAP systems are the most efficient in data processing, because they make it easy to reorganize and structure data for various user requests. MOLAP analytical tools allow you to perform complex calculations. Another advantage of MOLAP is the ability to quickly generate queries and get results. This is achieved by pre-shaping OLAP cubes.

The disadvantages of the MOLAP system include the limitation of the amount of processed data and data redundancy. for the formation of multidimensional cubes, in various aspects, the data has to be duplicated.

3. HOLAP (Hybrid OLAP - hybrid OLAP systems). Hybrid OLAP systems are a combination of ROLAP and MOLAP systems. Hybrid systems have tried to combine the advantages of the two systems: the use of multidimensional databases and the management of relational databases. HOLAP systems allow you to store a large amount of data in relational tables, and the processed data is located in pre-built multidimensional OLAP cubes. The advantages of this type of system are data scalability, fast data processing and flexible access to data sources.

There are other types of OLAP systems, but they are more marketing ploy manufacturers than an independent type of OLAP system.

These types include:

WOLAP (Web OLAP). OLAP system view with web interface support. In these OLAP systems, it is possible to access databases through a web interface.
DOLAP (Desktop OLAP). This kind of OLAP system enables users to download to local workplace database and work with it locally.
MobileOLAP. This is a feature of OLAP systems that allows you to work with the database remotely using mobile devices.
SOLAP (Spatial OLAP). This type of OLAP systems is designed for processing spatial data. It appeared as a result of the integration of geographic information systems and OLAP systems. These systems allow you to process data not only in alphanumeric format, but also in the form of visual objects and vectors.

OLAP system advantages

The use of an OLAP system gives an organization the ability to forecast and analyze various situations related to current activities and development prospects. These systems can be seen as complements to enterprise-level automation systems. All the advantages of OLAP systems directly depend on the accuracy, reliability and volume of the source data.

The main advantages of the OLAP system are:

consistency of initial information and analysis results... With an OLAP system, it is always possible to trace the source of information and determine the logical connection between the results obtained and the original data. The subjectivity of the analysis results is reduced.
multivariate analysis... The use of OLAP systems allows you to get many scenarios for the development of events based on a set of source data. Using analysis tools, it is possible to simulate situations according to the principle of "what will be if".
detail control... The detail in the presentation of the results may vary depending on the needs of the users. In this case, there is no need to carry out complex system settings and repeat calculations. The report can contain exactly the information that is necessary for making decisions.
revealing hidden dependencies... By building multidimensional relationships, it becomes possible to identify and identify hidden dependencies in various processes or situations that affect production activities.
creation of a single platform... Due to the use of OLAP systems, it becomes possible to create a single platform for all forecasting and analysis processes in the enterprise. In particular, OLAP system data are the basis for building budget forecasts, sales forecast, purchase forecast, strategic development plan, etc.

Send your good work in the knowledge base is simple. Use the form below

Students, graduate students, young scientists who use the knowledge base in their studies and work will be very grateful to you.

Posted on http://www.allbest.ru/

Course work

by discipline: Databases

Theme: TechnologyOLAP

Completed:

Chizhikov Alexander Alexandrovich

Introduction

1. Classification of OLAP products

2. OLAP client - OLAP server: pros and cons

3. The core of the OLAP system

3.1 Design principles

Conclusion

List of sources used

Applications

Vconducting

It is difficult to find a person in the computer world who, at least on an intuitive level, did not understand what databases are and why they are needed. Unlike traditional relational DBMSs, the concept of OLAP is not so widely known, although almost everyone has heard the mysterious term "OLAP cubes". What is OnLine Analytical Processing?

OLAP is not a single software product, not a programming language, or even a specific technology. If you try to cover OLAP in all its manifestations, then it is a set of concepts, principles and requirements that underlie software products that make it easier for analysts to access data. Despite the fact that hardly anyone would disagree with such a definition, it is doubtful that it will bring non-specialists even one iota closer to understanding the subject. Therefore, in your quest for knowledge of OLAP, it is better to go the other way. First, you need to find out why analysts need to somehow specifically facilitate access to data.

The point is that analysts are special consumers of corporate information. The analyst's task is to find patterns in large data sets. Therefore, the analyst will not pay attention to a single fact, he needs information about hundreds and thousands of events. By the way, one of the essential points that led to the emergence of OLAP is performance and efficiency. Imagine what happens when an analyst needs information and there is no OLAP in the enterprise. The analyst independently (which is unlikely) or with the help of a programmer makes the appropriate SQL query and receives the data of interest in the form of a report or exports it to a spreadsheet. In this case, a great many problems arise. Firstly, the analyst is forced to do something other than his own work (SQL programming) or wait for the programmers to complete the task for him - all this negatively affects labor productivity, the rate of heart attack and stroke rises, and so on. Secondly, one single report or table, as a rule, does not save the giants of thought and the fathers of Russian analysis - and the whole procedure will have to be repeated over and over again. Thirdly, as we have already found out, analysts do not ask about trifles - they need everything at once. This means (although the technique is moving forward by leaps and bounds) that the server of the corporate relational DBMS, which the analyst turns to, can think deeply and for a long time, blocking the rest of the transactions.

The concept of OLAP was born precisely to solve such problems. OLAP cubes are essentially meta-reports. By cutting meta-reports (cubes, that is) by dimensions, the analyst gets, in fact, the "usual" two-dimensional reports that interest him (these are not necessarily reports in the usual sense of the term - we are talking about data structures with the same functions). The advantages of cubes are obvious - the data must be requested from the relational DBMS only once - when building the cube. Since analysts, as a rule, do not work with information that is supplemented and changed on the fly, the generated cube is relevant for a fairly long time. Thanks to this, not only interruptions in the operation of the relational DBMS server are excluded (there are no queries with thousands and millions of response lines), but also the speed of access to data for the analyst himself sharply increases. In addition, as noted, performance is also improved by counting hierarchies' subtotals and other aggregated values at the time the cube is built.

Of course, you have to pay to increase productivity in this way. It is sometimes said that the data structure simply "explodes" - an OLAP cube can take tens or even hundreds of times more space than the original data.

Now that we have figured out a little about how OLAP works and what it is for, it is worth, nevertheless, to formalize our knowledge somewhat and give OLAP criteria already without simultaneous translation into ordinary human language. These criteria (12 in total) were formulated in 1993 by E.F. Coddom is the creator of the concept of relational DBMS and, concurrently, OLAP. We will not consider them directly, since they were later reworked into the so-called FASMI test, which determines the requirements for OLAP products. FASMI is an abbreviation for the name of each test item:

Fast (fast). This property means that the system should provide a response to a user request in an average of five seconds; however, most requests are processed within one second, and the most complex requests should be processed within twenty seconds. Recent studies have shown that a user begins to doubt the success of a request if it takes more than thirty seconds.

Analysis (analytical). The system must cope with any logical and statistical analysis specific to business applications and ensure that the results are saved in a form accessible to the end user. Analysis tools can include procedures for time series analysis, cost allocation, currency conversion, modeling changes in organizational structures, and some others.

Shared The system should provide ample opportunities to differentiate access to data and the simultaneous operation of many users.

Multidimensional The system should provide a conceptually multidimensional representation of the data, including full support for multiple hierarchies.

Information The power of various software products is characterized by the amount of input data processed. Different OLAP systems have different capacities: advanced OLAP solutions can handle at least a thousand times more data than the least powerful. There are a number of factors to consider when choosing an OLAP tool, including data duplication, RAM required, disk space usage, performance, data warehouse integration, and more.

1. Classification of OLAP products

So, the essence of OLAP lies in the fact that the initial information for analysis is presented in the form of a multidimensional cube, and the ability to arbitrarily manipulate it and obtain the necessary information sections - reports - is provided. At the same time, the end user sees the cube as a multidimensional dynamic table that automatically summarizes data (facts) in different sections (dimensions), and allows interactive control of calculations and the report form. These operations are performed by an OLAP machine (or an OLAP computing machine).

To date, many products have been developed in the world that implement OLAP technologies. To make it easier to navigate among them, classifications of OLAP products are used: by the way data is stored for analysis and by the location of the OLAP machine. Let's take a closer look at each category of OLAP products.

I'll start by classifying it according to how the data is stored. Let me remind you that multidimensional cubes are built on the basis of initial and aggregate data. Both raw and aggregate data for cubes can be stored in both relational and multidimensional databases. Therefore, there are currently three ways of storing data: MOLAP (Multidimensional OLAP), ROLAP (Relational OLAP) and HOLAP (Hybrid OLAP). Accordingly, OLAP products are divided into three similar categories in terms of data storage method:

1.In the case of MOLAP, the source and aggregate data is stored in a multidimensional database or in a multidimensional local cube.

2. In ROLAP products, the source data is stored in relational databases or in flat local tables on a file server. Aggregate data can be placed in service tables in the same database. Conversion of data from a relational database to multidimensional cubes occurs at the request of the OLAP tool.

3. In the case of using the HOLAP architecture, the original data remains in the relational database, and the aggregates are placed in the multidimensional one. The OLAP cube is built on demand by the OLAP tool based on relational and multidimensional data.

The next classification is based on the location of the OLAP machine. On this basis, OLAP products are divided into OLAP servers and OLAP clients:

In server OLAP tools, the calculations and storage of aggregate data are performed by a separate process - the server. The client application receives only the results of queries against multidimensional cubes that are stored on the server. Some OLAP servers support storing data only in relational databases, some only in multidimensional databases. Many modern OLAP servers support all three storage methods: MOLAP, ROLAP, and HOLAP.

The OLAP client works differently. Multidimensional cube building and OLAP calculations are performed in the memory of the client computer. OLAP clients are also divided into ROLAP and MOLAP. And some may support both types of data access.

Each of these approaches has its own pros and cons. Contrary to popular belief about the advantages of server-side tools over client-side tools, in a number of cases, using an OLAP client for users can be more efficient and profitable than using an OLAP server.

2. OLAP client - OLAP server: pros and cons

When building an information system, OLAP functionality can be implemented both by server and client OLAP tools. In practice, the choice is the result of a trade-off between performance and software cost.

The amount of data is determined by a combination of the following characteristics: the number of records, the number of dimensions, the number of dimension items, the length of the dimensions, and the number of facts. It is known that an OLAP server can handle larger amounts of data than an OLAP client with the same computer power. This is because the OLAP Server stores a multidimensional database on hard drives that contains precomputed cubes.

Client programs at the time of execution of OLAP operations execute queries to it in a SQL-like language, receiving not the entire cube, but its displayed fragments. The OLAP client at the time of operation must have the entire cube in RAM. In the case of the ROLAP architecture, it is necessary to preload the entire data array used for calculating the cube into memory. In addition, as the number of dimensions, facts, or dimension items increases, the number of aggregates grows exponentially. Thus, the amount of data processed by the OLAP client is in direct proportion to the amount of RAM on the user's PC.

Note, however, that most OLAP clients provide distributed computing. Therefore, the number of processed records, which limits the operation of the client OLAP tool, does not mean the volume of the primary data of the corporate database, but the size of the aggregated sample from it. The OLAP client generates a query to the DBMS, which describes the filtering conditions and the algorithm for preliminary grouping of primary data. The server finds, groups the records, and returns a compact selection for further OLAP calculations. The size of this sample can be tens or hundreds of times less than the volume of primary, non-aggregated records. Consequently, the need for such an OLAP client for PC resources is significantly reduced.

In addition, human perception limits are imposed on the number of dimensions. It is known that the average person can simultaneously operate 3-4, maximum 8 dimensions. With a larger number of dimensions in the dynamic table, the perception of information becomes much more difficult. This factor should be taken into account when calculating the RAM that the OLAP client might need.

The length of dimensions also affects the size of the OLAP tool address space that is used when calculating the OLAP cube. The longer the dimensions, the more resources are required to pre-sort the multidimensional array, and vice versa. Only short dimensions in the source data is another argument in favor of the OLAP client.

This characteristic is determined by two factors discussed above: the volume of processed data and the power of computers. With an increase in the number of, for example, dimensions, the performance of all OLAP tools decreases due to a significant increase in the number of aggregates, but the rate of decline is different. Let's demonstrate this dependence on the graph.

Figure 1. Dependence of the performance of client and server OLAP tools on the increase in data volume

The speed characteristics of the OLAP server are less sensitive to data growth. This is due to the different technologies used by the OLAP server and OLAP client to process user requests. For example, during a drill operation, the OLAP server accesses the stored data and pulls the data of this "branch". The OLAP client calculates the entire set of aggregates at the time of loading. However, up to a certain amount of data, the performance of the server and client tools is comparable. For OLAP clients that support distributed computing, the scope of performance comparability can extend to volumes of data that cover the OLAP analysis needs of a large number of users. This is confirmed by the results of internal testing of MS OLAP Server and OLAP-client "Kontur Standard". The test was performed on a PC IBM PC Pentium Celeron 400 MHz, 256 Mb for a sample of 1 million unique (i.e., aggregated) records with 7 dimensions containing from 10 to 70 members. The cube load time in both cases does not exceed 1 second, and the execution of various OLAP operations (drill up, drill down, move, filter, etc.) is performed in hundredths of a second.

When the sample size exceeds the amount of RAM, swapping with the disk begins and the OLAP client's performance drops sharply. Only from this moment can we talk about the advantage of the OLAP server.

It should be remembered that the "inflection point" defines the boundary of a sharp rise in the cost of an OLAP solution. For the tasks of each specific user, this point is easily determined by OLAP client performance tests. Such tests can be obtained from the developer company.

In addition, the cost of an OLAP server solution rises as the number of users increases. The point is that the OLAP server performs calculations for all users on one computer. Accordingly, the larger the number of users, the more RAM and processing power. Thus, if the volumes of processed data lie in the area of comparable performance of server and client systems, then, other things being equal, the use of an OLAP client will be more profitable.

The use of an OLAP server in the "classical" ideology provides for the unloading of data from relational DBMSs into a multidimensional database. The upload is performed for a certain period, so the OLAP server data does not reflect the current state. Only those OLAP servers that support the ROLAP mode of operation are devoid of this drawback.

Likewise, a variety of OLAP clients enable ROLAP and Desktop architectures with direct database access. This provides on-line analysis of the raw data.

OLAP Server has minimal requirements for the power of client terminals. Objectively, the requirements of the OLAP client are higher, since it performs calculations in the RAM of the user's PC. The state of the hardware park of a particular organization is the most important indicator that should be taken into account when choosing an OLAP tool. But here, too, there are pluses and minuses. OLAP Server does not use the enormous processing power of modern personal computers. If an organization already has a fleet of modern PCs, it is ineffective to use them only as display terminals and at the same time make additional costs for a central server.

If the power of users' computers is "poor," the OLAP client will be slow or not work at all. Buying one powerful server can be cheaper than upgrading all PCs.

It is useful here to take into account trends in hardware development. Since the amount of data for analysis is almost constant, the stable growth of PC power will lead to the expansion of the capabilities of OLAP clients and their displacement of OLAP servers into the segment of very large databases.

When using the OLAP server, only the data for display is transmitted to the client PC over the network, while the OLAP client receives the entire amount of the primary sample data.

Therefore, where the OLAP client is used, the network traffic will be higher.

But, when using an OLAP server, user operations, for example, drill down, generate new queries to the multidimensional database, and, therefore, a new data transfer. The execution of OLAP operations by an OLAP client is performed in RAM and, accordingly, does not cause new data streams in the network.

It should also be noted that modern networking hardware provides a high level of bandwidth.

Therefore, in the overwhelming majority of cases, the analysis of a "medium" database using the OLAP client will not slow down the user's work.

The cost of an OLAP server is quite high. This should also be added to the cost of a dedicated computer and the constant costs of administering a multidimensional database. In addition, the implementation and maintenance of an OLAP server requires highly qualified personnel.

The cost of an OLAP client is an order of magnitude lower than the cost of an OLAP server. Administration and additional technical equipment for the server is not required. When implementing an OLAP client, there are no high requirements for the qualifications of personnel. An OLAP client can be deployed much faster than an OLAP server.

Development of analytical applications using OLAP client tools is a fast process and does not require special training for the contractor. A user who knows the physical implementation of the database can develop an analytical application on his own, without the involvement of an IT specialist. When using an OLAP server, you need to study 2 different systems, sometimes from different vendors - for creating cubes on the server, and for developing a client application. The OLAP client provides a unified visual interface for describing cubes and customizing their user interfaces.

Let's look at the process of creating an OLAP application using a client tool.

Figure 2. Creating an OLAP application using the ROLAP client tool

The principle of operation of ROLAP clients is a preliminary description of the semantic layer behind which the physical structure of the initial data is hidden. In this case, data sources can be: local tables, RDBMS. The list of supported data sources is product specific. After that, the user can independently manipulate objects that he understands in terms of the subject area to create cubes and analytical interfaces.

The OLAP server client works differently. In OLAP server, when creating cubes, the user manipulates the physical descriptions of the database.

This creates custom descriptions in the cube itself. The OLAP Server client is configured per cube only.

Let us explain the principle of the ROLAP client using the example of creating a dynamic sales report (see diagram 2). Let the initial data for analysis be stored in two tables: Sales and Deal.

When creating a semantic layer, the data sources - the Sales and Deal tables - are described in terms that are understandable to the end user and turn into "Products" and "Deals". The "ID" field from the "Products" table is renamed to "Code", and "Name" to "Product", and so on.

Then the Sales business object is created. A business object is a flat table from which a multidimensional cube is formed. When a business object is created, the "Products" and "Deals" tables are combined by the "Code" field of the item. Since all the table fields are not required to be displayed in the report, the business object uses only the "Item", "Date" and "Amount" fields.

Next, an OLAP report is created based on the business object. The user selects a business object and drags its attributes to the column or row area of the report table. In our example, based on the Sales business object, we have created a report on sales of goods by month.

When working with an interactive report, the user can set filtering and grouping conditions with the same simple mouse movements. At this point, the ROLAP client accesses the data in the cache. On the other hand, the OLAP server client generates a new query against the multidimensional database. For example, by applying a filter by goods in the sales report, you can get a report on the sales of the goods of interest to us.

All OLAP application settings can be stored in a dedicated metadata repository, in an application, or in a multidimensional database system repository. The implementation depends on the specific software product.

So, in what cases can the use of an OLAP client for users be more efficient and profitable than using an OLAP server?

The economic feasibility of using an OLAP server arises when the amount of data is very large and unbearable for an OLAP client, otherwise the use of the latter is more justified. In this case, the OLAP client combines high performance characteristics with low cost.

Powerful analyst PCs are another reason for using OLAP clients. When using an OLAP server, these capacities are not used. Among the advantages of OLAP clients are the following:

The costs of implementing and maintaining an OLAP client are significantly lower than the costs of an OLAP server.

When using an OLAP client with an embedded machine, the data is transferred over the network once. When performing OLAP operations, no new data streams are generated.

Configuring ROLAP clients is simplified by eliminating an intermediate link - creating a multidimensional database.

3. The core of the OLAP system

3.1 Design principles

application client core data

From what has been said, it is clear that the OLAP engine is one of the most popular methods for data analysis today. There are two main approaches to solving this problem. The first of them is called Multidimensional OLAP (MOLAP) - the implementation of the mechanism using a multidimensional database on the server side, and the second Relational OLAP (ROLAP) - building cubes "on the fly" based on SQL queries to a relational DBMS. Each of these approaches has its own pros and cons. Their comparative analysis is beyond the scope of this work. Only the implementation of the core of the desktop ROLAP module will be described here.

This problem arose after the application of the ROLAP system, built on the basis of the Decision Cube components included in Borland Delphi. Unfortunately, the use of this set of components showed poor performance on large amounts of data. This problem can be mitigated by trying to cut off as much data as possible before feeding it into the cubes. But this is not always enough.

On the Internet and in the press, you can find a lot of information about OLAP systems, but almost nowhere is it said about how it works internally.

Scheme of work:

The general scheme of the desktop OLAP system can be represented as follows:

Diagram 3. Desktop OLAP system operation

The work algorithm is as follows:

1.Fetching data as a flat table or result SQL execution request.

2. Caching data and transforming it to a multidimensional cube.

3. Displaying the constructed cube using a crosstab or diagram, etc. In general, an arbitrary number of displays can be connected to one cube.

Let's consider how such a system can be arranged internally. We will start this from the side that can be viewed and touched, that is, from the mappings. The displays used in OLAP systems are most often of two types - crosstabs and charts. Consider a crosstab, which is the primary and most common way to display a cube.

In the figure below, rows and columns containing aggregated results are shown in yellow, cells containing facts are shown in light gray, and cells containing dimension data are highlighted in dark gray.

Thus, the table can be divided into the following elements, with which we will work in the future:

When filling in the matrix with facts, we should proceed as follows:

Based on the measurement data, determine the coordinates of the added element in the matrix.

Determine the coordinates of the columns and rows of totals that are affected by the added item.

Add an element to the matrix and the corresponding columns and rows of totals.

It should be noted that the resulting matrix will be very sparse, which is why its organization in the form of a two-dimensional array (a variant lying on the surface) is not only irrational, but, most likely, impossible due to the large dimension of this matrix, which cannot be stored. no amount of RAM will be enough. For example, if our cube contains information about sales for one year, and if there are only 3 dimensions in it - Customers (250), Products (500) and Date (365), then we will receive a fact matrix of the following dimensions: number of elements = 250 x 500 x 365 = 45 625 000. And this despite the fact that the filled elements in the matrix can be only a few thousand. Moreover, the larger the number of measurements, the more sparse the matrix will be.

Therefore, to work with this matrix, you need to use special mechanisms for working with sparse matrices. Various options for organizing a sparse matrix are possible. They are fairly well described in the programming literature, for example, in the first volume of the classic book The Art of Programming by Donald Knuth.

Let us now consider how you can determine the coordinates of a fact, knowing the corresponding measurements. To do this, let's take a closer look at the header structure:

At the same time, you can easily find a way to determine the numbers of the corresponding cell and the totals in which it falls. Several approaches can be proposed here. One is using a tree to find matching cells. This tree can be built by iterating over the selection. In addition, you can easily define an analytical recurrence formula to calculate the required coordinate.

The data stored in the table must be converted to use it. So, in order to improve performance when building a hypercube, it is desirable to find unique elements stored in columns that are dimensions of the cube. In addition, you can pre-aggregate facts for records that have the same dimension values. As mentioned above, the unique values available in the measurement fields are important for us. Then the following structure can be proposed for storing them:

Scheme 4. Structure of storage of unique values

By using such a structure, we significantly reduce the memory requirement. Which is quite relevant, since to increase the speed of work, it is desirable to store data in RAM. In addition, only an array of elements can be stored, and their values can be dumped to disk, since we will only need them when displaying a crosstab.

The ideas described above were the basis for creating the CubeBase component library.

Scheme 5. The structure of the CubeBase component library

TСubeSource performs caching and transformation of data into an internal format, as well as preliminary data aggregation. The TСubeEngine component calculates the hypercube and performs operations with it. In fact, it is an OLAP machine that transforms a flat table into a multidimensional dataset. The TCubeGrid component handles the display of the crosstab and controls the display of the hypercube. TСubeChart allows you to see the hypercube in the form of graphs, and the TСubePivote component controls the work of the cube core.

So, I examined the architecture and interaction of components that can be used to build an OLAP machine. Now let's take a closer look at the internal structure of the components.

The first stage of the system will be to download data and convert it to an internal format. The logical question will be - why is this necessary, because you can simply use data from a flat table, viewing it when building a cube slice. In order to answer this question, consider the structure of the table from the point of view of the OLAP machine. For OLAP systems, table columns can be either facts or dimensions. In this case, the logic of working with these columns will be different. In a hypercube, dimensions are actually axes, and dimension values are coordinates on those axes. In this case, the cube will be filled very unevenly - there will be combinations of coordinates that will not correspond to any records and there will be combinations that correspond to several records in the original table, and the first situation is more common, that is, the cube will look like the universe - empty space, in some places of which there are clusters of points (facts). Thus, if we pre-aggregate the data during the initial data loading, that is, we combine the records that have the same dimension values, while calculating the preliminary aggregated values of the facts, then in the future we will have to work with fewer records, which will increase the speed of work and reduce the requirements to the amount of RAM.

To construct slices of a hypercube, we need the following capabilities - defining coordinates (in fact, measurement values) for table records, as well as defining records that have specific coordinates (measurement values). Let's consider how you can implement these capabilities. The easiest way to store a hypercube is to use a database of its own internal format.

The transformations can be schematically represented as follows:

Scheme 6. Converting an internal format database to a normalized database

That is, instead of one table, we got a normalized database. In general, normalization slows down the system's performance, - database specialists may say, and in this they will be absolutely right, in the case when we need to get values for the elements of dictionaries (in our case, the values of dimensions). But the point is that we don't need these values at all at the stage of constructing a slice. As mentioned above, we are only interested in the coordinates in our hypercube, so we will define the coordinates for the measurement values. The simplest thing is to renumber the values of the elements. In order for the numbering to be unambiguous within one dimension, let us first sort the lists of measurement values (dictionaries, in terms of a database) in alphabetical order. In addition, let us renumber the facts, and the facts are preaggregated. We get the following scheme:

Scheme 7. Renumbering of the normalized database to determine the coordinates of measurement values

Now all that remains is to link the elements of different tables to each other. In relational database theory, this is done using special intermediate tables. It is enough for us to put each record in the dimension tables in correspondence with a list, the elements of which will be the numbers of the facts, in the formation of which these dimensions were used (that is, to determine all the facts that have the same value of the coordinates described by this dimension). For facts, respectively, for each record, we will put in correspondence the values of the coordinates along which it is located in the hypercube. Henceforth, everywhere under the coordinates of a record in a hypercube, the numbers of the corresponding records in the tables of dimension values will be understood. Then, for our hypothetical example, we get the following set that defines the internal representation of the hypercube:

Diagram 8. Internal representation of a hypercube

This will be our internal representation of the hypercube. Since we are doing it not for a relational database, we simply use fields of variable length as communication fields of dimension values (we could not do this in an RDB, since there the number of table columns is predefined).

We could try to use a set of temporary tables to implement a hypercube, but this method will provide too low performance (for example, a set of Decision Cube components), so we will use our own data storage structures.

To implement a hypercube, we need to use data structures that will provide maximum performance and minimum memory consumption. Obviously, we will have the main structures for storing dictionaries and a fact table. Consider the tasks that the dictionary should perform at maximum speed:

checking if an element is in the dictionary;

adding an item to the dictionary;

search for record numbers with a specific coordinate value;

coordinate search by measurement value;

search for a measurement value by its coordinate.

Various types and data structures can be used to fulfill these requirements. For example, you can use arrays of structures. In a real case, additional indexing mechanisms are needed to these arrays, which will increase the speed of data loading and information retrieval.

To optimize the work of the hypercube, it is necessary to determine what tasks need to be solved as a matter of priority, and by what criteria we need to improve the quality of work. The main thing for us is to increase the speed of the program, while it is desirable that not a very large amount of RAM is required. Improving performance is possible due to the introduction of additional data access mechanisms, for example, the introduction of indexing. Unfortunately, this increases the RAM overhead. Therefore, we will determine what operations we need to perform with the greatest speed. To do this, consider the individual components that implement the hypercube. These components are of two main types - dimension and fact table. For measurement, a typical task would be:

adding a new value;

determination of the coordinate by the value of the measurement;

determination of value by coordinate.

When adding a new element value, we need to check whether we already have such a value, and if so, then do not add a new one, but use the existing coordinate, otherwise we need to add a new element and determine its coordinate. This requires a way to quickly find the presence of the desired element (in addition, such a task arises when determining the coordinate by the value of the element). For this, the best way would be to use hashing. In this case, the optimal structure will be the use of hash trees, in which we will store references to elements. In this case, the elements will be the lines of the dimension dictionary. Then the structure of the dimension value can be represented as follows:

PFactLink = ^ TFactLink;

TFactLink = record

FactNo: integer; // index of the fact in the table

TDimensionRecord = record

Value: string; // measurement value

Index: integer; // coordinate value

FactLink: PFactLink; // pointer to the beginning of the list of elements of the fact table

And we will store links to unique elements in the hash tree. In addition, we need to solve the inverse transformation problem - to determine the measurement value by the coordinate. For maximum performance, direct addressing should be used. Therefore, you can use another array, in which the index is the coordinate of the dimension, and the value is a reference to the corresponding entry in the dictionary. However, you can do it easier (and save on memory) if you properly arrange the array of elements so that the index of the element is its coordinate.

The organization of the array that implements the list of facts does not present any special problems due to its simple structure. The only remark will be that it is desirable to calculate all the aggregation methods that may be needed and that can be calculated incrementally (for example, the sum).

So, we have described a way to store data in the form of a hypercube. It allows you to form a set of points in a multidimensional space based on the information in the data warehouse. In order for a person to be able to work with this data, they must be presented in a form convenient for processing. At the same time, a pivot table and graphs are used as the main types of data presentation. Moreover, both of these methods are actually projections of a hypercube. In order to ensure maximum efficiency when constructing representations, we will build on what these projections represent. Let's start with the pivot table as the most important for data analysis.

Let's find ways to implement such a structure. There are three parts that make up a pivot table: these are row headers, column headers, and the actual aggregated fact value table. The simplest way to represent a fact table is to use a two-dimensional array that can be dimensioned by constructing headers. Unfortunately, the simplest method will be the most inefficient, because the table will be very sparse, and memory will be used extremely inefficiently, as a result of which it will be possible to build only very small cubes, otherwise there may not be enough memory. Thus, we need to select a data structure for storing information that will provide the maximum speed of searching / adding a new element and at the same time the minimum consumption of RAM. This structure will be the so-called sparse matrices, about which you can read in more detail in Knuth. Various ways of organizing the matrix are possible. In order to choose the option that suits us, let's first consider the structure of the table headers.

Headers have a clear hierarchical structure, so it would be natural to assume using a tree for storing them. In this case, the structure of a tree node can be schematically depicted as follows:

Appendix C

In this case, it is logical to store a reference to the corresponding element of the dimension table of a multidimensional cube as a dimension value. This will reduce the memory overhead for storing the slice and speed up your work. Links are also used as parent and child nodes.

To add an element to the tree, you must have information about its location in the hypercube. As such information, it is necessary to use its coordinate, which is stored in the dictionary of dimension values. Let's consider the scheme of adding an element to the header tree of a pivot table. In this case, we use the values of the measurement coordinates as the initial information. The order in which these dimensions are listed is determined by the desired aggregation method and is the same as the hierarchy levels in the header tree. As a result of the work, you need to get a list of columns or rows of the pivot table to which you need to add an element.

ApplicationD

As the initial data for determining this structure, we use the coordinates of measurements. In addition, for definiteness, we will assume that we are defining the column of interest to us in the matrix (we will consider how we define a row a little later, since it is more convenient to use other data structures there, see the reason for this choice as well below). Let us take integers as coordinates - numbers of measurement values that can be determined as described above.

So, after performing this procedure, we get an array of references to the columns of the sparse matrix. Now you need to perform all the necessary actions with the strings. To do this, inside each column, you need to find the desired element and add the corresponding value there. For each dimension in the collection, you need to know the number of unique values and the actual set of these values.

Now let's consider in what form it is necessary to represent the values within the columns - that is, how to determine the required row. Several approaches can be used for this. The simplest thing would be to represent each column as a vector, but since it will be very sparse, the memory will be extremely inefficient. To avoid this, we will use data structures that will provide more efficient representation of sparse one-dimensional arrays (vectors). The simplest of these will be a regular list, singly or doubly linked, but it is uneconomical in terms of accessing elements. Therefore, we will use a tree, which will provide faster access to the elements.

For example, you could use exactly the same tree as for columns, but then you would have to create its own tree for each column, which would result in significant memory and processing overhead. Let's do a little trickier - we will get one tree for storing all combinations of dimensions used in strings, which will be identical to the one described above, but its elements will not be pointers to strings (which do not exist as such), but their indices, and the values of the indices themselves are not of interest to us and are only used as unique keys. We will then use these keys to find the desired item within the column. The columns themselves are most easily represented in the form of an ordinary binary tree. The resulting structure can be represented as follows:

Figure 9. Image of a pivot table in the form of a binary tree

You can use the same procedure to determine the appropriate row numbers as in the procedure for determining the columns of a PivotTable above. However, the row numbers are unique within a single PivotTable and identify the elements in the vectors that are the columns of the PivotTable. The simplest way to generate these numbers is to maintain a counter and increment it by one when a new element is added to the row header tree. These column vectors themselves are most easily stored as binary trees, where the row number value is used as the key. In addition, it is also possible to use hash tables. Since the procedures for working with these trees are discussed in detail in other sources, we will not dwell on this and consider the general scheme for adding an element to a column.

In a generalized form, the sequence of actions for adding an element to the matrix can be described as follows:

1.Define the line numbers to which items are added

2.Define the set of columns to which items are added

3.For all columns, find the elements with the required row numbers and add the current element to them (adding includes connecting the required number of fact values and calculating aggregated values that can be determined incrementally).

After executing this algorithm, we get a matrix that is a pivot table that we needed to build.

Now a couple of words about filtering when building a slice. The easiest way to implement it is just at the stage of constructing the matrix, since at this stage there is access to all the required fields, and, in addition, the values are aggregated. At the same time, when a record is retrieved from the cache, its compliance with the filtering conditions is checked, and if it is not complied with, the record is discarded.

Since the structure described above fully describes the pivot table, the task of its visualization will be trivial. In this case, you can use standard table components that are available in almost all programming tools for Windows.

The first product to run OLAP queries was Express (from IRI). However, the term OLAP itself was coined by Edgar Codd, "the father of relational databases." And Codd's work was funded by Arbor, the company that released its own OLAP product, Essbase (later acquired by Hyperion, which was acquired by Oracle in 2007) - a year earlier. Other well-known OLAP products include Microsoft Analysis Services (formerly OLAP Services, part of SQL Server), Oracle OLAP Option, IBM's DB2 OLAP Server (actually EssBase with additions from IBM), SAP BW, Brio products, BusinessObjects, Cognos, MicroStrategy and other manufacturers.

From a technical point of view, the products on the market are divided into "physical OLAP" and "virtual". In the first case, there is a program that performs a preliminary calculation of aggregates, which are then stored in a special multidimensional database that provides fast extraction. Examples of such products are Microsoft Analysis Services, Oracle OLAP Option, Oracle / Hyperion EssBase, Cognos PowerPlay. In the second case, the data is stored in relational DBMS, and the aggregates may not exist at all or be created on the first request in the DBMS or the analytical software cache. Examples of such products are SAP BW, BusinessObjects, Microstrategy. Systems based on "physical OLAP" provide consistently better response times than "virtual OLAP" systems. Virtual OLAP vendors claim to be more scalable to support very large amounts of data.

In this work, I would like to take a closer look at the product of BaseGroup Labs - Deductor.

Deductor is an analytical platform, i.e. the basis for creating complete application solutions. The technologies implemented in Deductor allow, on the basis of a single architecture, to go through all the stages of building an analytical system: from creating a data warehouse to automatically selecting models and visualizing the results obtained.

System composition:

Deductor Studio is the analytical core of the Deductor platform. Deductor Studio includes a full set of mechanisms that allows you to obtain information from an arbitrary data source, carry out the entire processing cycle (cleaning, transforming data, building models), displaying the results in the most convenient way (OLAP, tables, charts, decision trees ...) and export results.

Deductor Viewer is an end user workstation. The program allows you to minimize the requirements for personnel, because all the required operations are performed automatically using the previously prepared processing scripts, there is no need to think about the method of obtaining the data and the mechanisms for their processing. The Deduсtor Viewer user only needs to select the report of interest.

Deductor Warehouse is a multidimensional cross-platform data warehouse that accumulates all the information necessary for analyzing the subject area. The use of a single repository allows for convenient access, high processing speed, information consistency, centralized storage and automatic support of the entire data analysis process.

4. Client-Server

Deductor Server is designed for remote analytical processing. It provides the ability to both automatically "run" data through existing scripts on the server, and retrain existing models. Using Deductor Server allows you to implement a full three-tier architecture in which it acts as an application server. Access to the server is provided using the Deductor Client.

Work principles:

1. Data import

Analysis of any information in Deductor begins with data import. As a result of the import, the data is brought to a form suitable for subsequent analysis using all the mechanisms available in the program. The nature of the data, format, DBMS, etc. do not matter, since mechanisms for working with all are unified.

2. Data export

The presence of export mechanisms allows you to send the obtained results to third party applications, for example, transfer a sales forecast to the system for generating a purchase order or post a prepared report on a corporate website.

3. Data processing

Processing in Deductor means any action associated with some kind of data transformation, for example, filtering, building a model, cleaning, and so on. Actually, in this block, the most important actions from the point of view of analysis are performed. The most significant feature of the processing mechanisms implemented in Deductor is that the data obtained as a result of processing can again be processed by any of the methods available to the system. Thus, you can build arbitrarily complex processing scenarios.

4. Visualization

You can visualize data in Deductor Studio (Viewer) at any stage of processing. The system independently determines in what way it can do this, for example, if it is trained neural network, then in addition to tables and diagrams, you can view the graph of the neural network. The user needs to select the desired option from the list and configure several parameters.

5. Mechanisms of integration

Deductor does not provide data entry tools - the platform is focused exclusively on analytical processing. To use information stored in heterogeneous systems, flexible import-export mechanisms are provided. Interaction can be organized using batch execution, working in OLE server mode, and calling Deductor Server.

6. Duplication of knowledge

Deductor allows you to implement one of the most important functions of any analytical system - support of the knowledge replication process, i.e. providing an opportunity for employees who are not versed in the methods of analysis and methods of obtaining a particular result, to receive an answer based on models prepared by an expert.

Zconcluding

In the present work, such an area of modern information technologies as a data analysis system. The main tool for analytical information processing - OLAP - technology is analyzed. The essence of the concept of OLAP and the meaning of OLAP systems in the modern business process are disclosed in detail. The structure and operation of the ROLAP server is described in detail. The Deductor analytical platform is presented as an example of OLAP data implementation. The submitted documentation is developed and meets the requirements.

OLAP technology is a powerful real-time data processing tool. OLAP Server allows you to organize and present data across a variety of analytic streams, and transforms data into valuable information that helps companies make better decisions.

The use of OLAP systems provides consistently high levels of performance and scalability, supporting data volumes of several gigabytes, which can be accessed by thousands of users. With the help of OLAP technologies, information is accessed in real time, i.e. processing requests no longer slows down the analysis process, ensuring its efficiency and effectiveness. Visual administration tools allow you to develop and implement even the most complex analytical applications, making the process simple and fast.