标题: INT303 big data Analytics 复习自用 [打印本页] 作者: 王柳 时间: 昨天 08:08 标题: INT303 big data Analytics 复习自用 大要涵盖了本课程ppt内容,对其中部分内容的理解不肯定准确。
1. Introduction
数据分析师的工作DATA ANALYST=Statisticians + Computer Engineering
“Data is the new oil. It’s valuable,
but if unrefined it cannot really be
used. It has to be changed into gas,
plastic, chemicals, etc to create a → 数据必须被分解分析才能有代价
valuable entity that drives profitable
activity; so must data be broken
down, analyzed for it to have value.”
数据是复杂的COMPLEX且相互接洽的INTERCONNECTED:范例多,Spatial and temporal aspect时空交错,不同范例能相互接洽
通过对数据的挖掘获取代价:例如smart cars,Personalized medicine,All major soccer and basketball teams use data mining to make decisions.
“Data Mining is the study of collecting, processing, analyzing, and gaining useful insights from data” – Charu Aggarwal 数据挖掘要网络,处理,分析,得到有效的信息,也就是说要:
- To be able to understand it,能够理解
- To process it,能够分析
- To extract value from it,能够获得代价
- To visualize it,能够视觉化
- To communicate能够互换
data mining pipeline:最底子是collection→preprocessing→mining→postprocessing
网络数据时第一步要决定测量什么,Deciding what to measure is important
对于数据的分析即What does the data tell us,我们可以通过who, when, where, why, how来进行
在分析时,怎样判断数据的相似性→聚类clustering
Triadic closure principle:毗连经常以三角的形式存在,如A B都认识C,那么他们很有大概会在某时相互认识。通过类似的Recommendation systems可以弥补数据的空白。
对于数据的预测:真实数据Regression,是否题目Binary classification,多答案题目Classification
2. Data
Data是Collection of data objects and their attributes,每个对象(每条数据)有多个属性
在数据库database中我们以为数据是稠密的dense(只有少数空属性)
relational data:包含类似固定属性的数据组,同时包含numeric and categorical attributes叫Mixed Relational Data
一般属性分为:
Numeric:Examples: dates, temperature, time, length, value, count.可视为点或向量可视化
Categorical:Examples: eye color, zip codes, strings, rankings (e.g, good, fair, bad), height in
{tall, medium, short}
Set:Example: breakfast{egg, bread, milk...}一个属性有一组数据
Dependent:与其他数据有接洽的数据,分为:• Ordered sequences: Each object is an ordered sequence of values • Time series Data: Sequence of ordered (over “time”) numeric values • Spatial data: objects are fixed on specific geographic locations • Graph data: A collection of pairwise relationships
属性的转化:
deal with missing or inconsistent information ,缺失的补上(随机值,平均值,聚类平均值等),非常值删除或改值。
Feature extraction and selection: 特性提取
create a useful representation of the data by extracting useful features 提取选择有效信息
为什么:我们获取的数据大多数时候并不是分离好的数据,有时会是一整段话这样的信息,我们需要从这一段话中提取到我们想要的数据。
例如:对文本数据进行提取,我们可以先把它从字符归到同一小写的单词,通过对单词出现频率进行筛选提取得到需要的数据。(会有缺失信息,但大多数时候是可以接受的)
可以用Inverse Document Frequency来对一个单词在不同文档中的独特性进行盘算,IDF=log(total number of documents / num of docs that contain word