Hadoop的文本内容统计
0.在HDFS上创建文件hdfs dfs -mkdir -p /web/nginx/log
hdfs dfs -put access.log /web/nginx/log/
1.首先更换镜像文件
curl -o /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo
https://i-blog.csdnimg.cn/direct/db7573dcbecb4120a7194876d21b90f8.png
yum clean all
yum makecache
yum -y update
2.下载python3
yum install python3
3.查找Hadoop Streaming工具
cd /export/server/hadoop-3.2.2/share/hadoop/tools/lib
ls
https://i-blog.csdnimg.cn/direct/a2ca599c31294d4b87d65fae91f699e5.png
4.对文本内容进行统计
创建文件夹命令:mkdir /WordCountTask
vi mapper.py
进入后:
#!/usr/bin/python2
import sys
for fs in sys.stdin:
fss = fs.split()
if len(fss) == 15:
if fss == "200":
print(fss,fss,fss,fss,fss,fss,fss)
5.运行第3题脚本
https://i-blog.csdnimg.cn/direct/9d124bcc809f41258d3a9a148bd4cd8d.png
$HADOOP_HOME/
/export/server/hadoop-3.2.2/
hadoop jar /export/server/hadoop-3.2.2/share/hadoop/tools/lib/hadoop-streaming-3.2.2.jar \
-input /web/nginx/log/ \
-output /web/nginx/result1 \
-file /WordCountTask/mapper.py \
-mapper /WordCountTask/mapper.py
6.运行第4题脚本 \
进WordCountTask里
vi /m1.py
进入后:
#!/usr/bin/python2
import sys
for syes in sys.stdin:
i_01 = syes.split()
if len(i_01) == 15:
if i_01 == "200":
print(i_01)
编写下一个
vi /m2.py
进入后:
#!/usr/bin/python2
import sys
i_02 = {}
for syees in sys_stdin:
if syees in i_02:
i_02 += 1
else:
i_02 = 1
for y in i_02:
a = y.replace('\n','').replace('\t','')
print(a,i_02)
7.第4题运行
hadoop jar /export/server/hadoop-3.2.2/share/hadoop/tools/lib/hadoop-streaming-3.2.2.jar \
-input /web/nginx/log/ \
-output /web/nginx/result1 \
-file /WordCountTask/mapper.py \
-mapper /WordCountTask/mapper.py \
-file /WordCountTask/m1.py \
-reducer /WordCountTask/m2.py
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。
页:
[1]