SQL Server Data Mining Architecture Overview | The MicrosoftВ Data Warehouse Toolkit: With SQL ServerВ 2005 and the MicrosoftВ Business Intelligence Toolset

Microsoft SQL Server 2005 Data Mining offers a rich, well- tuned , integrated, and easy-to-use data mining environment. For those of you who worked with data mining in SQL Server 2000, the 2005 release is a great leap forward. In this section we give an overview of the data mining environment using the high-level architecture drawing presented in Figure 10.1 as a guide.

Figure 10.1: The SQL Server data mining architecture

From a system point of view, integrating data mining into the overall product allows the data mining service to take advantage of the functionality offered by the rest of the system. For example, point A in Figure 10.1 shows how data mining models are built using the dimensional engine, leveraging its ability to load data and quickly perform the base statistical calculations like sums, averages, and counts. The data mining server can also easily pull case data from both relational Analysis Services databases as seen at point B in Figure 10.1.

Point C in Figure 10.1 shows how the typical developer will first experience data mining by creating a BI Studio Analysis Services project and then using the Data Mining Wizard to create a new data mining structure and an initial data mining model. The mining structure is a new construct that provides a metadata layer allowing several mining models to work with the same input data. Each mining model in a mining structure can have different algorithms and parameters. The wizard provides model building guidance with auto selection and adjustment of variables based on the algorithm selected. The wizard also helps you create case sets, including complex, nested queries.

The Data Mining Design Environment

When the wizard is finished building the mining structure and the initial data mining model, it drops the developer into the data mining design environment. At this point, the mining model has not been built; the project contains only the metadata that defines the model. The Data Mining Designer is also significantly improved, with new model viewers to examine the models created by the different algorithms, validation checks, and direct access to column properties and model parameters. The Designer is broken up into five tabs to support the data mining process. (Refer to Figure 10.7 to see an example of the Designer.) Several of these tabs work with the completed mining model as it exists on Analysis Services, so they are not available until the model has been built and deployed. The first tab shows the Mining Structure with its underlying data source view. The second tab is the Mining Models tab, showing the source mining structure and all the data mining models that have been defined based on this structure. The third tab is the Mining Model Viewer that lets you select a model and a viewer type, and then provides several sub-tabs, each with a different graphic or tabular representation of the contents of the model. The Mining Model Viewer tab is the primary tool the data miner uses to explore the various models. The fourth tab is the Mining Model Accuracy Chart. This tab provides two ways to compare the relative accuracy of certain kinds of predictive models: the Lift Chart and the Classification Matrix. Finally, the fifth tab is the Mining Model Prediction tab that allows the data miner to specify a prediction query using a rudimentary query builder interface.

Figure 10.7: Cluster diagram for the city economic data

Build, Deploy, and Process

Most of the functions in the Data Mining Designer work with the actual model as it exists in Analysis Services. This means once the wizard is complete, the developer must build and deploy the project (which includes processing the model cubes) before any more progress can be made. Building the project writes the metadata out to project files in the development environment. The actual model does not come into being until the project is deployed to an Analysis Services instance. At that point, BI Studio creates a database for the project in Analysis Services. It writes out the model structure metadata and the definition of each model. Finally, it creates a cube for each model and processes the models, inserting the training data so the algorithm can calculate the rules, correlations , and other relationships. Until the project is deployed and a model is processed , it cannot be viewed in the viewers.

Tip	It is possible to process a single model rather than all models in a model structure by selecting the model in the Mining Models tab, and then selecting Process Model from the Mining Model menu. This can save a lot of time if you are working with large case sets and complex models.

Accessing the Mining Models

As you see at Point D in Figure 10.1, Data Mining eXtensions to SQL language (DMX) is at the core of all the Microsoft data mining API. As the name suggests, DMX is an extension to SQL designed to create, train, modify, and query data mining models. DMX was introduced with SQL Server 2000 as part of the OLE DB for Data Mining APIs. It has been enhanced in SQL Server 2005 with additional options for data sources and more flexible SELECT functionality. An easy way to begin learning about DMX is to use the Mining Model Prediction tab in the Data Mining Designer and examine the syntax it generates for DMX queries. The code can be copied to a DMX query window in SQL Studio for further exploration. Although DMX is an extension to SQL, queries are submitted to the Analysis Services serverthats where the data mining services are.

When the development environment submits its DMX commands and queries to Analysis Services, it uses the XML for Analysis (XMLA) APIs. In fact, because Analysis Services is a native XMLA server, the data mining models are available to any application as a web service by using SOAP protocols. It is still possible to access the server with old OLE DB APIs, or ADO and ADO.NET.

Integration Services and Data Mining

Integration Services can play a major role in the data mining process as shown in Figure 10.1. Many of the standard transforms used for data cleansing and data integration are particularly valuable for building the training and test case sets. Besides the obvious tasks, like Data Conversion and Derived Column, tasks like the Percentage Sampling, Row Sampling, Conditional Split, Lookup, and Merge Join are powerful components the data miner can use to build a set of packages to prepare the case sets for the data mining process. This is shown at point E in Figure 10.1. In addition to the standard tasks, there are two Integration Services tasks that directly interact with the data mining models, shown at point F in Figure 10.1. The Data Mining Model Training destination task is the target for a training set used to train (or re-train) an existing mining model. This capability is perfect for the ongoing re-training required to keep certain mining models currentrecommendation models, for example. The Data Mining Query task is specifically designed to do prediction joins against a model in the IS pipeline, once the model has been trained and tested . This could be used to add scores to the Customer table or identify significant data anomalies during the nightly ETL process. It could also be used in a real-time mode to flag transactions that were potentially fraudulent.

Additional Features

There are several additional features that will be important for certain applications. Most of these can be found at point G in Figure 10.1 and are listed briefly here:

Extensibility: Microsoft has provided a set of COM APIs that allow developers to integrate additional data mining algorithms into the data mining engine. They can integrate custom viewers into the data mining designer as well. Someone could even create a new viewer for an existing Microsoft algorithm.
Analysis Management Objects (AMO): AMO is a new API for managing the creation and maintenance of data mining objects, including creating, processing, backing up, restoring, and securing.
Stored procedures and user -defined functions: A developer can create what are essentially stored procedures or user-defined functions and load them as managed assemblies into Analysis Services. This allows clients to work with large mining models through the intermediate layer of the server-based managed assembly.
Text mining: It is possible to do some interesting data mining on unstructured text data, like the text in HTML files in a set of web directories, or even text fields in a database. For example, use the Integration Services Term Extraction transformation to build a dictionary of terms found in the text files. Then use data mining to create classification rules to categorize the documents according to the terms they contain. This is essentially a data mining application, not a new data mining algorithm, but it has value in certain areas.

Reference

To find more information about the text mining technique, see the Books Online topic Term Extraction Transformation, or visit www.sqlserverdatamining.com for a text mining tutorial.

Architecture Summary

Our goal in this section is to show how data mining fits into the overall SQL Server 2005 architecture, and to show how well data mining has been integrated into the SQL Server environment. It should be clear by now that data mining is a serious element of the BI toolset. Data mining and application developers will have a lot more to learn before they are proficient with all the data mining components. Fortunately, Microsofts documentation is heavily weighted toward the development community. Additional help should be easy to find. A good place to start is a web site called www.sqlserverdatamining.com, maintained by the SQL Server data mining development team.