• 欢迎访问速搜资源吧,如果在网站上找不到你需要的资源,可以在留言板上留言,管理员会尽量满足你!

【速搜问答】数据挖掘系统是什么

问答 admin 9个月前 (04-13) 153次浏览 已收录 0个评论

汉英对照:
Chinese-English Translation:

数据挖掘系统(data mining system)是指从存放在数据库、数据仓库或其他信息库中的大量数据中挖掘出有趣知识的系统。近年来为了推动数据挖掘在实际中的应用,许多研究者对数据挖掘系统的体系结构做了大量的研究工作。 数据挖掘(data mining)又称为数据库中的知识发现,是指从存放在数据库、数据仓库或其他信息库中的大量数据中挖掘出有趣知识的过程。近年来为了推动数据挖掘在实际中的应用,许多研究者对数据挖掘系统的体系结构做了大量的研究工作。

Data mining system is a system that can mine interesting knowledge from a large amount of data stored in database, data warehouse or other information base. In recent years, in order to promote the application of data mining in practice, many researchers have done a lot of research work on the architecture of data mining system. Data mining, also known as knowledge discovery in database, refers to the process of mining interesting knowledge from a large amount of data stored in database, data warehouse or other information base. In recent years, in order to promote the application of data mining in practice, many researchers have done a lot of research work on the architecture of data mining system.

特点

characteristic

一个结构合理的数据挖掘系统应该具有以下几个特点:

A data mining system with reasonable structure should have the following characteristics:

(1)系统功能和辅助工具的完备性;

(1) The completeness of system functions and auxiliary tools;

(2)系统的可扩展性;

(2) The scalability of the system;

(3)支持多种数据源;

(3) Support multiple data sources;

(4) 对大数据量的处理能力;

(4) The ability to process large amounts of data;

(5) 良好的用户界面和结果展示能力。

(5) Good user interface and result display ability.

当前出现的数据挖掘系统主要包括集中式的和分布式的数据挖掘系统,而每种系统的具体结构及其各个组成部分却有多种不同的实现技术和实现方式。

The current data mining systems mainly include centralized and distributed data mining systems, but the specific structure of each system and its components have a variety of different implementation technologies and methods.

集中式的数据挖掘系统

Centralized data mining system

单一数据库/数据仓库的数据挖掘系统是当前发展得较为成熟的数据挖掘应用系统,许多商业性的数据挖掘应

The data mining system of single database / data warehouse is a mature data mining application system, and many commercial data mining should be applied

集中式数据挖掘系统的体系结构

Architecture of centralized data mining system

用户界面及知识表示层

User interface and knowledge representation layer

在该层通过提供友好的用户界面及利用数据可视化技术展示挖掘结果,可以大大提高系统的易用性,数据挖掘的可视化是指利用可视化技术从大量的数据集中发现隐含的和有用的知识。数据挖掘的可视化主要包括数据的可视化、挖掘过程的可视化和挖掘模型的可视化,当前的可视化技术主要包括传统的几何学方法( 如曲线图、直方图、散点图、饼图等)、SOM 网可视化技术、平行坐标系技术、面向象素的可视化技术等。基于 SOM 网络和基于平行坐标系的可视化技术是目前应用较多的 2 项技术,它们的原理都是通过把高维数据映射为二维数据从而将数据显示在二维平面上。如汪加才等设计的一个基于 SOM 网的可视化挖掘系统 VISMiner,刘勘等研究了平行坐标系技术在数据挖掘系统中的具体应用。

In this layer, by providing a friendly user interface and using data visualization technology to display mining results, the usability of the system can be greatly improved. Data mining visualization refers to the use of visualization technology to discover hidden and useful knowledge from a large number of data sets. The visualization of data mining mainly includes data visualization, mining process visualization and mining model visualization. The current visualization technology mainly includes traditional geometry methods (such as curve, histogram, scatter, pie chart, etc.), SOM network visualization technology, horizontal coordinate system technology, pixel oriented visualization technology, etc. SOM network and parallel coordinate system based visualization technology are two widely used technologies. Their principle is to map high-dimensional data into two-dimensional data, so as to display the data on the two-dimensional plane. For example, visminer, a visual mining system based on SOM network designed by Wang Jiacai, and Liu Kan, etc. studied the specific application of parallel coordinate system technology in data mining system.

控制层

Control layer

控制层用于控制系统的执行流程,协调各功能部件间的关系和执行顺序,主要包括对数据挖掘任务进行解析,并根据任务解析的结果判断挖掘任务涉及到的数据和应该采用的数据挖掘算法。

The control layer is used to control the execution process of the system, coordinate the relationship and execution order among the functional components, mainly including the analysis of the data mining task, and judge the data involved in the mining task and the data mining algorithm should be adopted according to the results of the task analysis.

数据挖掘任务一般是通过数据挖掘语言定义和解释的,当前许多研究者提出了自己的数据挖掘语言,这些语言从结构上看都是类 SQL 语言,如 DMQL 语言等, 但是并没有实现挖掘语言的标准化。2000 年 3 月,微软推出了一个新的数据挖掘语言规范 OLE DB for Data Mining,向着数据挖掘语言标准化又迈进了一大步,Amir Netz 等详细介绍了如何将 OLE DB for DM 规范应用到数据挖掘系统之中。

Data mining tasks are generally defined and explained by data mining language. At present, many researchers have proposed their own data mining languages. These languages are SQL like languages in structure, such as dmql, but they do not realize the standardization of mining languages. In March 2000, Microsoft launched a new data mining language specification OLE DB for data mining, which is a big step towards the standardization of data mining language. Amir Netz and others introduced in detail how to apply OLE DB for DM specification to data mining system.

数据源层

Data source layer

为了提高数据的一致性和完整性,进行数据挖掘前首先应将分散存储在多个数据源中的数据通过数据清理和数据集成等预处理操作集成到一个统一的数据库/ 数据仓库中。为了提高系统的可扩展性,屏蔽数据源采用的具体数据库产品,数据库接口应该采用 ODBC、JDBC 或 OLE DB 等技术,以便于更改数据源。赵志宏、钱卫宁等分别提出了基于数据仓库和大规模数据库的数据挖掘系统框架及其应用。

In order to improve the consistency and integrity of data, the data stored in multiple data sources should be integrated into a unified database / data warehouse through data cleaning and data integration before data mining. In order to improve the scalability of the system and shield the specific database products used by the data source, the database interface should adopt ODBC, JDBC or OLE DB technology to change the data source. Zhao Zhihong and Qian Weining proposed data mining system framework and its application based on data warehouse and large-scale database respectively.

数据库可以通过 4 种形式集成到数据挖掘系统中:无藕合的,松藕合的,半松藕合的和紧藕合的。最理想的是紧藕合方式,即通过把数据挖掘查询优化成循环的数据挖掘和检索过程从而将 2 者结合起来,这样可以充分利用数据库所具有的查询、汇总等数据处理功能,减少数据挖掘系统开发负担,提高系统的效率。Rosa Meo 提出了一种使用数据挖掘语言 Mine Rul e 实现与数据库紧藕合的数据挖掘系统框架。

Database can be integrated into data mining system in four forms: uncoupled, loosely coupled, semi loosely coupled and tightly coupled. The most ideal way is tight coupling, that is, by optimizing the data mining query into a circular data mining and retrieval process, so as to combine the two, which can make full use of the query, summary and other data processing functions of the database, reduce the burden of data mining system development, and improve the efficiency of the system. Rosa MEO proposes a data mining system framework which uses mine rule to realize the tight coupling with database.

待挖掘数据层

Data layer to be mined

该层为数据挖掘层提供符合数据挖掘算法要求的待挖掘数据集,待挖掘数据集是由数据源层中与挖掘任务相关的数据经过数据变换和数据规约等数据预处理操作形成的。

This layer provides the data set to be mined that meets the requirements of data mining algorithm for the data mining layer. The data set to be mined is formed by the data related to the mining task in the data source layer through data transformation, data specification and other data preprocessing operations.

除了直接基于数据库/ 数据仓库中的数据进行挖掘外,数据挖掘还可以基于联机分析处理(OLAP)进行,称作联机分析挖掘(OLAM)。由于 OLAM 将 2 者结合了起来,充分发挥 2 者的优点,所以可以使数据挖掘具有较高的效率和良好的交互性。Jia-wei Han 教授等提出了一种 OLAP 和 DM 集成的 OLAM 系统的结构框架,并且开发出了基于这种结构的一个数据挖掘系统 BD Miner。Sanjay Goil 等研究了一种基于并行处理技术的可扩展的 OLAP 和数据挖掘集成的系统体系结构。

In addition to data mining directly based on database / data warehouse, data mining can also be based on online analytical processing (OLAP), which is called Online Analytical Mining (OLAM). Because OLAM combines the two and gives full play to their advantages, it can make data mining have high efficiency and good interaction. Professor Jia Wei Han proposed a framework of OLAM system integrated with OLAP and DM, and developed a data mining system BD miner based on this framework. Sanjay goil and others have studied a scalable system architecture of OLAP and data mining integration based on parallel processing technology.

挖掘层

Excavation layer

该层是数据挖掘系统的核心,该层的具体实现直接关系到整个系统的功能性和可扩展性。数据挖掘主要包括概念/ 类描述、关联规则分析、分类及预测、聚类分析、孤立点分析和演变分析等几种类型的模式的挖掘,针对各种类型的模式人们又都提出了多种不同的实现算法,对于一个特定的数据挖掘系统应该包括哪些类型的模式挖掘算法则要由该系统的开发目的及其面向的具体应用领域来决定。

This layer is the core of data mining system, and its implementation is directly related to the functionality and scalability of the whole system. Data mining mainly includes concept mining/ For the mining of several types of patterns, such as class description, association rule analysis, classification and prediction, clustering analysis, outlier analysis and evolution analysis, many different algorithms have been proposed for each type of pattern. For a specific data mining system, which types of pattern mining algorithms should be included should be determined by the development purpose and specific target of the system It depends on the field of application.

为了提高系统的可扩展性,许多系统采用了组件技术来实现数据挖掘算法及其管理。当前比较成熟的组件技术主要有 COM / DCOM、EJB / Java RMI 和 CORBA / IIOP,组件是指应用系统中可以明确辨识的、具有一定功能的构成模块,一个组件的典型结构包括组件接口和组件实现 2 部分,组件接口和组件实现是相互分离的,只要在应用程序中保持统一的接口标准,就可以方便地在系统中加人或替换组件。如刘君强等设计的 smart Miner 数据挖掘系统中的算法模块采用了组件对象模型 COM 技术进行构造,并通过算法描述库为组件提供注册机制,任何符合 COM 标准的算法模块可方便地加入到系统中。在史忠植等人研究开发的 MSMiner 系统中各种数据挖掘核心算法以动态链接库 DLL 的形式加以实现,并可以在系统运行过程中动态加载,该系统中还提供了专门的算法管理模块,通过挖掘算法库管理各种挖掘算法, 并通过元数据的形式提供算法的注册机制。

In order to improve the scalability of the system, many systems use component technology to realize data mining algorithm and its management. At present, COM / DCOM, EJB / Java RMI and CORBA / IIOP are relatively mature component technologies. Component refers to the component modules that can be clearly identified and have certain functions in the application system. The typical structure of a component includes component interface and component implementation As long as the unified interface standard is maintained in the application program, it is convenient to add or replace components in the system. For example, the algorithm module of smart miner data mining system designed by Liu Junqiang adopts COM technology to construct, and provides registration mechanism for components through algorithm description library. Any algorithm module conforming to com standard can be easily added to the system. In the MSMiner system developed by Shi Zhongzhi and others, all kinds of data mining core algorithms are implemented in the form of DLL, and can be loaded dynamically in the process of system operation. The system also provides a special algorithm management module, which manages all kinds of mining algorithms through the mining algorithm library, and provides algorithm registration mechanism in the form of metadata.

知识评价及知识表示层

Knowledge evaluation and knowledge representation layer

在将挖掘结果呈现给用户之前通过知识评价可以有效地去除冗余的、无用的挖掘结果, 对提高系统的可用性有着重要的意义.知识评价的度量标准主要包括有效性、新颖性、潜在有用性和最终可理解性. 聂艳霞等详细介绍了知识评价与数据挖掘过程结合的 4 种方式。

Before presenting the mining results to users, redundant and useless mining results can be effectively removed by knowledge evaluation, which is of great significance to improve the usability of the system. The measurement standards of knowledge evaluation mainly include effectiveness, novelty, potential usefulness and final comprehensibility. Nie Yanxia and others introduced four ways of combining knowledge evaluation and data mining process in detail.

数据挖掘系统挖掘的知识模式经过知识评价后可以存储在知识库中以便重用,为了便于不同数据挖掘系统间知识模式的共享,DMG 组织(the data mining)提出了预言模型标记语言 PMML(prediction model markup language),PMML 是一种基于 XML 的语言,为数据挖掘产生的预言模型提供了一种统一的定义和描述标准,使得遵循该标准的不同厂商的数据挖掘系统之间可以方便地共享预言模型,提高了模型的可重用性和系统的可扩展性。Wettschereck 等介绍了 PMML 在模型交换中的应用。

The knowledge patterns mined by data mining system can be stored in the knowledge base for reuse after knowledge evaluation. In order to facilitate the sharing of knowledge patterns among different data mining systems, the DMG organization puts forward prediction model markup language (PMML), which is based on XML The Oracle language provides a unified definition and description standard for the oracle model generated by data mining, which makes it easy for data mining systems of different manufacturers to share the oracle model, and improves the reusability of the model and the scalability of the system. Wettschereck introduced the application of PMML in model exchange.

上面对集中式数据挖掘系统的各个组成部分的实现技术做了详细介绍,目前已出现了许多基于集中式结构的商业数据挖掘软件并开始得到广泛的应用。比较有影响的商业软件主要有 SAS 公司的 Enterprise Miner,IBM 公司的 Intelligent Miner 和 SPS 公司的 Clementine 等。Enterprise Miner 实现了与 SAS 数据仓库和 OLAP 的集成,可以实现从提出数据、抓住数据到得到解答的端到端的知识发现。Intelligent Miner for Data 支持对多种数据源的挖掘,如传统文件、数据库、数据仓库和数据中心等。Clementine 采用了数据挖掘过程模型 CRISP-DM,能让用户轻松、容易且有效地执行与管理整个数据挖掘的工作。同时这 3 种软件目前都提供了对 PMML 2.1 的支持,实现了挖掘模型的共享。

The implementation technology of each component of the centralized data mining system is introduced in detail. At present, many commercial data mining software based on the centralized structure have appeared and begun to be widely used. Enterprise Miner of SAS company, intelligent miner of IBM company and Clementine of SPS company are the most influential commercial software. Enterprise Miner realizes the integration with SAS data warehouse and OLAP, which can realize the end-to-end knowledge discovery from putting forward data, grasping data to getting answers. Intelligent miner for data supports the mining of multiple data sources, such as traditional files, databases, data warehouses and data centers. Clementine adopts the data mining process model CRISP-DM, which enables users to easily, easily and effectively perform and manage the whole data mining work. At the same time, these three kinds of software provide support for PMML 2.1 and realize the sharing of mining model.

分布式的数据挖掘系统

Distributed data mining system

随着网络技术和分布式数据库技术的发展和成熟, 分布式数据库已经得到越来越广泛的应用, 原来数据的集中式存储和管理也逐渐转变为分布式存储和管理. 数据存储方式的变化也必然会促进数据挖掘技术及其系统结构的变化. 由于实际应用中数据的安全性、私有性、保密性以及网络的带宽限制, 使得首先将分散存储的数据集中到一个数据库中再进行挖掘的方法是不可行的, 因此分布式数据挖掘成为在分布式数据库中进行数据挖掘的最为可行的解决办法。

With the development and maturity of network technology and distributed database technology, distributed database has been more and more widely used, The former centralized storage and management of data has gradually changed into distributed storage and management. The change of data storage mode will inevitably promote the change of data mining technology and its system structure, It is not feasible to mine the scattered data in a database, so distributed data mining is the most feasible solution in distributed database.

步骤

step

分布式数据挖掘包括以下几个步骤:

Distributed data mining includes the following steps:

(1)剖分待挖掘数据成 P 个子集,P 为可用的处理器个数,并把每个数据子集发送到各个处理器;

(1) The data to be mined is divided into P subsets, P is the number of available processors, and each data subset is sent to each processor;

(2)每个处理器运行数据挖掘算法于其局部数据子集,处理器可以运行不同的数据挖掘算法;

(2) Each processor runs data mining algorithm in its local data subset, and the processor can run different data mining algorithms;

(3)组合各个数据挖掘算法发现的局部知识成全局、一致的发现知识。

(3) The local knowledge found by each data mining algorithm is combined to form a global and consistent discovery knowledge.

研究内容

research contents

在分布式数据挖掘中有 4 种关键技术:数据集中、并行数据挖掘、知识吸收和分布式软件引擎。

There are four key technologies in Distributed Data Mining: data centralization, parallel data mining, knowledge absorption and distributed software engine.

分布式数据挖掘的研究主要包括分布式数据挖掘算法和分布式数据挖掘体系结构的研究 2 个方面.当前已经出现不少分布式和并行的数据挖掘算法, 如并行挖掘关联规则的算法 CD (count distribution)、DD (Data distribution),以及 PDM 等。在分布式数据挖掘系统结构方面,也已出现了许多基于不同技术的体系结构。如张学明等研究了一种基于 CORBA 技术并采用多线程并行数据挖掘机制的分布式并行体系结构。陈刚对基于移动 Agent 技术的分布式数据挖掘体系结构进行了研究。侯敬军等则提出了一种基于 Web Services 的分布式体系结

The research of distributed data mining mainly includes distributed data mining algorithm and distributed data mining architecture. At present, there are many distributed and parallel data mining algorithms, such as CD (count distribution), DD (data distribution), PDM and so on. In the aspect of distributed data mining system architecture, there are many architectures based on different technologies. For example, Zhang Xueming studied a distributed parallel architecture based on CORBA technology and multi thread parallel data mining mechanism. Chen Gang studied the architecture of distributed data mining based on mobile agent technology. This paper proposes a distributed Web Services Architecture Based on Hou Jingjie


速搜资源网 , 版权所有丨如未注明 , 均为原创丨转载请注明原文链接:【速搜问答】数据挖掘系统是什么
喜欢 (0)
[361009623@qq.com]
分享 (0)
发表我的评论
取消评论
表情 贴图 加粗 删除线 居中 斜体 签到

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址