• 欢迎访问速搜资源吧,如果在网站上找不到你需要的资源,可以在留言板上留言,管理员会尽量满足你!

【速搜问答】数据清洗是什么

问答 admin 3年前 (2020-08-13) 586次浏览 已收录 0个评论

汉英对照:
Chinese-English Translation:

数据清洗是指发现并纠正数据文件中可识别的错误的最后一道程序,包括检查数据一致性,处理无效值和缺失值等。与问卷审核不同,录入后的数据清理一般是由计算机而不是人工完成。其目的在于删除重复信息、纠正存在的错误,并提供数据一致性。

Data cleaning is the last procedure to find and correct the identified errors in data files, including checking data consistency, handling invalid and missing values, etc. Different from the questionnaire audit, the data cleaning after input is generally completed by computer rather than manual. Its purpose is to remove duplicate information, correct existing errors, and provide data consistency.

数据清洗是指发现并纠正数据文件中可识别的错误的最后一道程序,包括检查数据一致性,处理无效值和缺失值等。与问卷审核不同,录入后的数据清理一般是由计算机而不是人工完成。

Data cleaning is the last procedure to find and correct the identified errors in data files, including checking data consistency, handling invalid and missing values, etc. Different from the questionnaire audit, the data cleaning after input is generally completed by computer rather than manual.

基本概念

Basic concepts

数据清洗(Data cleaning)– 对数据进行重新审查和校验的过程,目的在于删除重复信息、纠正存在的错误,并提供数据一致性。

Data cleaning – the process of reviewing and verifying data to remove duplicate information, correct existing errors, and provide data consistency.

数据清洗从名字上也看的出就是把“脏”的“洗掉”,指发现并纠正数据文件中可识别的错误的最后一道程序,包括检查数据一致性,处理无效值和缺失值等。因为数据仓库中的数据是面向某一主题的数据的集合,这些数据从多个业务系统中抽取而来而且包含历史数据,这样就避免不了有的数据是错误数据、有的数据相互之间有冲突,这些错误的或有冲突的数据显然是我们不想要的,称为“脏数据”。我们要按照一定的规则把“脏数据”“洗掉”,这就是数据清洗。而数据清洗的任务是过滤那些不符合要求的数据,将过滤的结果交给业务主管部门,确认是否过滤掉还是由业务单位修正之后再进行抽取。不符合要求的数据主要是有不完整的数据、错误的数据、重复的数据三大类。数据清洗是与问卷审核不同,录入后的数据清理一般是由计算机而不是人工完成。

From the name of data cleaning, it can be seen that “dirty” is “washed out”. It refers to the last procedure to find and correct the identifiable errors in the data file, including checking data consistency, processing invalid and missing values, etc. Because the data in the data warehouse is a collection of data oriented to a certain topic. These data are extracted from multiple business systems and contain historical data. In this way, it is inevitable that some data are wrong data and some data conflict with each other. These wrong or conflicting data are obviously not what we want, which is called “dirty data”. We should “wash out” the “dirty data” according to certain rules, which is data cleaning. The task of data cleaning is to filter those data that do not meet the requirements, and submit the filtered results to the competent business department to confirm whether to filter out or to correct by the business unit before extracting. The data that does not meet the requirements mainly include incomplete data, wrong data and repeated data. Data cleaning is different from questionnaire audit. The data cleaning after input is generally completed by computer rather than manual.

一致性检查

Consistency check

一致性检查(consistency check)是根据每个变量的合理取值范围和相互关系,检查数据是否合乎要求,发现超出正常范围、逻辑上不合理或者相互矛盾的数据。例如,用 1-7 级量表测量的变量出现了 0 值,体重出现了负数,都应视为超出正常值域范围。SPSS、SAS、和 Excel 等计算机软件都能够根据定义的取值范围,自动识别每个超出范围的变量值。具有逻辑上不一致性的答案可能以多种形式出现:例如,许多调查对象说自己开车上班,又报告没有汽车;或者调查对象报告自己是某品牌的重度购买者和使用者,但同时又在熟悉程度量表上给了很低的分值。发现不一致时,要列出问卷序号、记录序号、变量名称、错误类别等,便于进一步核对和纠正。

Consistency check is to check whether the data meet the requirements according to the reasonable value range and relationship of each variable, and find out the data beyond the normal range, logically unreasonable or contradictory. For example, if the variables measured by 1-7 scale have zero value and the weight has negative number, they should be regarded as beyond the normal range. SPSS, SAS, Excel and other computer software can automatically identify each out of range variable value according to the defined value range. The logically inconsistent answers may come in many forms: for example, many respondents reported that they drove to work and reported that they did not have a car; or respondents reported that they were heavy buyers and users of a certain brand, but at the same time gave a low score on the familiarity scale. When inconsistency is found, the questionnaire serial number, record number, variable name, error category, etc. shall be listed for further verification and correction.

无效值和缺失值的处理

Handling of invalid and missing values

由于调查、编码和录入误差,数据中可能存在一些无效值和缺失值,需要给予适当的处理。常用的处理方法有:估算,整例删除,变量删除和成对删除。

Due to the survey, coding and input errors, there may be some invalid and missing values in the data, which need to be handled properly. The common processing methods are: estimation, whole case deletion, variable deletion and pair deletion.

估算(estimation)。最简单的办法就是用某个变量的样本均值、中位数或众数代替无效值和缺失值。这种办法简单,但没有充分考虑数据中已有的信息,误差可能较大。另一种办法就是根据调查对象对其他问题的答案,通过变量之间的相关分析或逻辑推论进行估计。例如,某一产品的拥有情况可能与家庭收入有关,可以根据调查对象的家庭收入推算拥有这一产品的可能性。

Estimation. The simplest way is to replace the invalid and missing values with the sample mean, median or mode of a variable. This method is simple, but does not fully consider the existing information in the data, and the error may be large. The other way is to estimate by correlation analysis or logical inference among variables according to the answers of other questions. For example, the ownership of a product may be related to household income, and the possibility of owning the product can be calculated based on the household income of the respondents.

整例删除(casewise deletion)是剔除含有缺失值的样本。由于很多问卷都可能存在缺失值,这种做法的结果可能导致有效样本量大大减少,无法充分利用已经收集到的数据。因此,只适合关键变量缺失,或者含有无效值或缺失值的样本比重很小的情况。

Case wise deletion is to remove samples with missing values. As many questionnaires may have missing values, the result of this practice may lead to a significant reduction in the effective sample size, and the collected data can not be fully utilized. Therefore, it is only suitable for the case that the key variables are missing, or the proportion of samples with invalid or missing values is very small.

变量删除(variable deletion)。如果某一变量的无效值和缺失值很多,而且该变量对于所研究的问题不是特别重要,则可以考虑将该变量删除。这种做法减少了供分析用的变量数目,但没有改变样本量。

Variable deletion. If there are many invalid and missing values of a variable, and the variable is not particularly important for the problem studied, then the variable can be deleted. This approach reduces the number of variables for analysis, but does not change the sample size.

成对删除(pairwise deletion)是用一个特殊码(通常是 9、99、999 等)代表无效值和缺失值,同时保留数据集中的全部变量和样本。但是,在具体计算时只采用有完整答案的样本,因而不同的分析因涉及的变量不同,其有效样本量也会有所不同。这是一种保守的处理方法,最大限度地保留了数据集中的可用信息。

Pairwise deletion uses a special code (usually 9, 99, 999, etc.) to represent invalid and missing values, while retaining all variables and samples in the dataset. However, only samples with complete answers are used in the specific calculation, so the effective sample size of different analysis will be different due to different variables involved. This is a conservative approach that preserves as much information as possible in the dataset.

采用不同的处理方法可能对分析结果产生影响,尤其是当缺失值的出现并非随机且变量之间明显相关时。因此,在调查中应当尽量避免出现无效值和缺失值,保证数据的完整性。

Different treatment methods may affect the analysis results, especially when the missing values are not random and the variables are significantly correlated. Therefore, in the investigation, we should try to avoid invalid values and missing values to ensure the integrity of the data.

数据清洗原理

Principle of data cleaning

数据清洗原理:利用有关技术如数理统计、数据挖掘或预定义的清理规则将脏数据转化为满足数据质量要求的数据。

Data cleaning principle: the dirty data can be transformed into data that meets the data quality requirements by using relevant technologies such as mathematical statistics, data mining or predefined cleaning rules.

主要类型

Main types

残缺数据

incomplete data

这一类数据主要是一些应该有的信息缺失,如供应商的名称、分公司的名称、客户的区域信息缺失、业务系统中主表与明细表不能匹配等。对于这一类数据过滤出来,按缺失的内容分别写入不同 Excel 文件向客户提交,要求在规定的时间内补全。补全后才写入数据仓库。

This kind of data is mainly due to the lack of information, such as the name of the supplier, the name of the branch company, the lack of regional information of the customer, and the mismatch between the main table and the detailed list in the business system. For this kind of data filtered out, according to the missing content, they are written into different Excel files and submitted to the customer, and they are required to complete within the specified time. Write to the data warehouse after completion.

错误数据

Wrong data

这一类错误产生的原因是业务系统不够健全,在接收输入后没有进行判断直接写入后台数据库造成的,比如数值数据输成全角数字字符、字符串数据后面有一个回车操作、日期格式不正确、日期越界等。这一类数据也要分类,对于类似于全角字符、数据前后有不可见字符的问题,只能通过写 SQL 语句的方式找出来,然后要求客户在业务系统修正之后抽取。日期格式不正确的或者是日期越界的这一类错误会导致 ETL 运行失败,这一类错误需要去业务系统数据库用 SQL 的方式挑出来,交给业务主管部门要求限期修正,修正之后再抽取。

The reason for this kind of error is that the business system is not sound enough. After receiving the input, no judgment is made and the data is directly written into the background database. For example, the numerical data is input into full width digital characters, there is a carriage return operation after the string data, the date format is incorrect, and the date is out of line. This kind of data also needs to be classified. For problems like full width characters and invisible characters before and after the data, you can only find them by writing SQL statements, and then ask customers to extract them after the business system is corrected. If the date format is incorrect or the date is out of line, ETL will fail to run. This kind of error needs to be picked out by SQL in the business system database and submitted to the competent business department for correction within a specified period of time, and then extracted after correction.

重复数据

Duplicate data

对于这一类数据——特别是维表中会出现这种情况——将重复数据记录的所有字段导出来,让客户确认并整理。

For this type of data – especially in dimension tables – all fields of duplicate data records are exported for customer confirmation and collation.

数据清洗是一个反复的过程,不可能在几天内完成,只有不断的发现问题,解决问题。对于是否过滤,是否修正一般要求客户确认,对于过滤掉的数据,写入 Excel 文件或者将过滤数据写入数据表,在 ETL 开发的初期可以每天向业务单位发送过滤数据的邮件,促使他们尽快地修正错误,同时也可以做为将来验证数据的依据。数据清洗需要注意的是不要将有用的数据过滤掉,对于每个过滤规则认真进行验证,并要用户确认。

Data cleaning is a repeated process, can not be completed in a few days, only constantly find problems, solve problems. In addition, they can also write the data in the initial stage of the business to verify whether they can filter the data in the early stage of the business as soon as possible. Data cleaning needs to pay attention to not filtering out the useful data, for each filtering rule carefully verified, and to the user to confirm.


速搜资源网 , 版权所有丨如未注明 , 均为原创丨转载请注明原文链接:【速搜问答】数据清洗是什么
喜欢 (0)
[361009623@qq.com]
分享 (0)
发表我的评论
取消评论
表情 贴图 加粗 删除线 居中 斜体 签到

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址