容器云云计算技能实验七 MapReduce编程基础

水军大提督 发表于 2024-6-9 15:08:44

云计算技能实验七 MapReduce编程基础

参考资料为：
教材代码-林子雨编著《大数据基础编程、实验和案例教程（第2版）》教材所有章节代码_厦大数据库实验室博客
1.实验学时
4学时
2.实验目标

[*]认识MapReduce编程框架。
[*]相识Map部分和Reduce部分的工作原理。
[*]实现简单的MapReduce编程。
3.实验内容
（一）实现词频统计的根本的MapReduce编程。
首先创建两个txt文件。
让后向内里输入想要统计的句子。
https://img-blog.csdnimg.cn/6db425031bf44ed8a34b5a1ed067a2bc.png
然后启动ecplise完成步调编写：
https://img-blog.csdnimg.cn/5e10eef0b41640a0a279f26e333e72e5.png
首先编写map处理逻辑：(这里选择在windows上先编写，然后在linux上再复现一次)
下面为java代码：此为map处理逻辑
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
   private static final IntWritable one = new IntWritable(1);
   private Text word = new Text();
   public TokenizerMapper() {
   }
   public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
         StringTokenizer itr = new StringTokenizer(value.toString());
         while(itr.hasMoreTokens()) {
            this.word.set(itr.nextToken());
            context.write(this.word, one);
         }
   }
} https://img-blog.csdnimg.cn/1ca5451f103d4063bc6e256b66bb9fc4.png
在map阶段，文件wordfile1.txt和文件wordfile2.txt中的数据被读入，然后以键值对的情势被提交给map函数处理。键值对交给map函数之后，就可以运行自界说的map处理逻辑。
之后编写reduce处理逻辑。
Map阶段处理得到的中间结果，经过shuffle阶段，会分发给对应的reduce任务处理。
下面为java代码，此为reduce任务处理
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
   private IntWritable result = new IntWritable();
   public IntSumReducer() {
   }
   public void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
         int sum = 0;
         IntWritable val;
         for(Iterator i$ = values.iterator(); i$.hasNext(); sum += val.get()) {
            val = (IntWritable)i$.next();
         }
         this.result.set(sum);
         context.write(key, this.result);
   }
} https://img-blog.csdnimg.cn/14ec27eda8894ee980f2934d4ef6ed1c.png
之后编写main函数
为了使TokenizerMapper类和IntSumReduce类可以或许正常协同工作，须要在主函数中通过job类设置hadoop步调的运行环境。
下面为java代码，此为main函数
public static void main(String[] args) throws Exception {
   Configuration conf = new Configuration();
   String[] otherArgs = (new GenericOptionsParser(conf, args)).getRemainingArgs();
   if(otherArgs.length < 2) {
         System.err.println("Usage: wordcount <in> [<in>...] <out>");
         System.exit(2);
   }
   Job job = Job.getInstance(conf, "word count");    //设置环境参数
   job.setJarByClass(WordCount.class);             //设置整个程序的类名
   job.setMapperClass(WordCount.TokenizerMapper.class); //添加Mapper类
   job.setReducerClass(WordCount.IntSumReducer.class);//添加Reducer类
   job.setOutputKeyClass(Text.class);                //设置输出类型
   job.setOutputValueClass(IntWritable.class);          //设置输出类型
   for(int i = 0; i < otherArgs.length - 1; ++i) {
         FileInputFormat.addInputPath(job, new Path(otherArgs));//设置输入文件
   }
   FileOutputFormat.setOutputPath(job, new Path(otherArgs));//设置输出文件
   System.exit(job.waitForCompletion(true)?0:1);
} https://img-blog.csdnimg.cn/932edecbe70c4fa0ac8e8146a16caeed.png
然后是打包步调。
先打开hadoop对应文件夹：
https://img-blog.csdnimg.cn/eba30ca9ac164c539d1bb96788506cc7.png
将代码传入文件夹之后，使用之前下载的java jar进行编译
https://img-blog.csdnimg.cn/b135df480df345b9a5501cf919a71f7f.png
编译完成之后查看文件夹，多出三个.class文件，然后进行文件的打包。
https://img-blog.csdnimg.cn/ad12ab2ddb9c4e6d866007c09ce0cbac.png
然后启动hadoop
https://img-blog.csdnimg.cn/235d4c2232b74f7cab42bf4f2618dda6.png
然后输入命令查看结果：
https://img-blog.csdnimg.cn/d8ee940b6e824277b067235445dc2729.png
结果：
https://img-blog.csdnimg.cn/194f08598a404759b26bd7d4b085dfac.png

（二）设置eclipse环境，跑词频统计的步调。
先启动eclipse
https://img-blog.csdnimg.cn/0061f25b36e54de78fc3a4429a5f9a0d.png

创建新的java工程进行编写步调
https://img-blog.csdnimg.cn/6e4231e431724fb6bb9cd929a0d1e614.png
然后导入jar包
https://img-blog.csdnimg.cn/41525bf76e5048b5802ff11da9586b91.png
https://img-blog.csdnimg.cn/0f5f2a8a163f4c4ebf5294f753efc6b3.png
https://img-blog.csdnimg.cn/aa9696f2c20a46cc8439937b791458c4.png
https://img-blog.csdnimg.cn/f9d2ba4cc5564daa8714f54fd1294443.png
然后开始编写java步调
先创建新的java类开始编写
https://img-blog.csdnimg.cn/2a1a734391cb4f33890f89db2e603d73.png
然后将之前编写号的代码输入到java文件之中，然后进行运行查看结果。
完整步调：
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public WordCount() {
}
public static void main(String[] args) throws Exception {
   Configuration conf = new Configuration();
   String[] otherArgs = (new GenericOptionsParser(conf, args)).getRemainingArgs();
   if(otherArgs.length < 2) {
         System.err.println("Usage: wordcount <in> [<in>...] <out>");
         System.exit(2);
   }
   Job job = Job.getInstance(conf, "word count");
   job.setJarByClass(WordCount.class);
   job.setMapperClass(WordCount.TokenizerMapper.class);
   job.setCombinerClass(WordCount.IntSumReducer.class);
   job.setReducerClass(WordCount.IntSumReducer.class);
   job.setOutputKeyClass(Text.class);
   job.setOutputValueClass(IntWritable.class);
   for(int i = 0; i < otherArgs.length - 1; ++i) {
         FileInputFormat.addInputPath(job, new Path(otherArgs));
   }
   FileOutputFormat.setOutputPath(job, new Path(otherArgs));
   System.exit(job.waitForCompletion(true)?0:1);
}
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
   private static final IntWritable one = new IntWritable(1);
   private Text word = new Text();
   public TokenizerMapper() {
   }
   public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
         StringTokenizer itr = new StringTokenizer(value.toString());
         while(itr.hasMoreTokens()) {
            this.word.set(itr.nextToken());
            context.write(this.word, one);
         }
   }
}
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
   private IntWritable result = new IntWritable();
   public IntSumReducer() {
   }
   public void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
         int sum = 0;
         IntWritable val;
         for(Iterator i$ = values.iterator(); i$.hasNext(); sum += val.get()) {
            val = (IntWritable)i$.next();
         }
         this.result.set(sum);
         context.write(key, this.result);
   }
}
} https://img-blog.csdnimg.cn/01b57f37846b4afa8ee49e211f11423a.png
查看结果：
https://img-blog.csdnimg.cn/9e961e4f82674b52bb8625cc417f08da.png

结果已经出现出来，然后把java步调打包生成jar包。放到hadoop平台之上运行。
https://img-blog.csdnimg.cn/55c7f31900b5486ab5d2c2938bde9b5d.pnghttps://img-blog.csdnimg.cn/2dbbf04718f245fda97cfe745a21eab5.png
查看对应文件夹：
https://img-blog.csdnimg.cn/e99df12f5e0c409f8b6f0b6da420670d.png

然后为了运行，先启动hadoop
https://img-blog.csdnimg.cn/d91a4aac7bf64f60837b15d5c296bf03.png
然后删除input和output防止堕落
https://img-blog.csdnimg.cn/faa8f34efedd4ee69ba17a3a5557d750.png
然后新建文件
https://img-blog.csdnimg.cn/0ded193eaa03452dba4e1fea3daa097d.png
由于先前已经将对应文件传入linux，然后可以考虑现将文件上传到hdfs中的/user/Hadoop/input中
然后使用jar命令查看;
https://img-blog.csdnimg.cn/12572030cfe84cc7a0de6ce93e8c6849.png
然后查看结果：
https://img-blog.csdnimg.cn/a559bcc51fa643c1b9881bcc12877673.png
（三）编写MapReduce步调，实现计算平均结果的步调。
首先编写步调。
步调的主要点是输入三个txt文件，然后计算三个txt文件对应的科目之和的平均值。这里的导入方法与前面相似。
然后先创建新的项目，导入jar包
https://img-blog.csdnimg.cn/42095a4ed70b44309c60d6fa1eab0705.png

然后将对应的代码输入java文件中。
注意导入的包的个数题目。

然后导入对应的包：
https://img-blog.csdnimg.cn/96220cd0a2394b848b9ed1a5034131b8.png
https://img-blog.csdnimg.cn/e431ff7a91a748e8b278de1807f9c641.png
https://img-blog.csdnimg.cn/4a034a6fe9ad488f8bbc3a246bac6de4.png
https://img-blog.csdnimg.cn/67e0a7e7adc7453e861c736153ab7db5.png
然后导出jar文件。
https://img-blog.csdnimg.cn/09dad48105f84982ac20b6cffbd97a6c.png
编写对应的txt文件，设置
https://img-blog.csdnimg.cn/7638e6938cb543c58f08f5db54f7e973.png
https://img-blog.csdnimg.cn/1edbe25a9a4d48168709c2a40446ceee.png
然后输入./bin/hdfs dfs -cat output*/
注意，这里的名字不能使用单个字符，不然会报错！！！，以是背面改成了多个字符。
https://img-blog.csdnimg.cn/72e170cba80044539388a0f8ca9f9506.png
4.思考题
（一）MapReduce的工作原理是什么？
MapRedece分为两部分，一个是Map函数，一个是Reduce函数。Map函数接受一个键值对（key-value pair），产生一组中间键值对。MapReduce框架会将map函数产生的中间键值对里键相同的值传递给一个reduce函数。 Reduce函数接受一个键，以及相关的一组值，将这组值进行归并产生一组规模更小的值（通常只有一个或零个值）。
下面是一个图介绍MapReduce的工作流程：
https://img-blog.csdnimg.cn/eacc23aa8a404744bb8a7d116e63c8d2.png
MapReduce库先把user program的输入文件分别为M份（M为用户界说），每一份通常有16MB到64MB，如图左方所示分成了split0~4；然后使用fork将用户进程拷贝到集群内别的呆板上。
user program的副本中有一个称为master，其余称为worker，master是负责调度的，为空闲worker分配作业（Map作业或者Reduce作业），worker的数量也是可以由用户指定的。
被分配了Map作业的worker，开始读取对应分片的输入数据，Map作业数量是由M决定的，和split逐一对应；Map作业从输入数据中抽取出键值对，每一个键值对都作为参数传递给map函数，map函数产生的中间键值对被缓存在内存中。
缓存的中间键值对会被定期写入本地磁盘，而且被分为R个区，R的大小是由用户界说的，未来每个区会对应一个Reduce作业；这些中间键值对的位置会被通报给master，master负责将信息转发给Reduce worker。
master通知分配了Reduce作业的worker它负责的分区在什么位置（肯定不止一个地方，每个Map作业产生的中间键值对都可能映射到所有R个不同分区），当Reduce worker把所有它负责的中间键值对都读过来后，先对它们进行排序，使得相同键的键值对聚集在一起。因为不同的键可能会映射到同一个分区也就是同一个Reduce作业（谁让分区少呢），以是排序是必须的。
reduce worker遍历排序后的中间键值对，对于每个唯一的键，都将键与关联的值传递给reduce函数，reduce函数产生的输出会添加到这个分区的输出文件中。
　　当所有的Map和Reduce作业都完成了，master唤醒正版的user program，MapReduce函数调用返回user program的代码。

（二）Hadoop是如何运行MapReduce步调的？
有两个方法，这两个方法的前提是须要启动hadoop才可以运行。
方法一：
将本身的编译软件与hadoop相连（我用的是MyEclipse去链接hadoop），直接运行步调。运行完成之后在输出文件夹就可以查看输出的文件。
方法二：
方法二的话更加复杂，须要将mapreduce步调打包成jar文件。须要在linux上的eclipse编写好步调之后，将步调导出打包，之后实行这个jar文件，在输出文件中查看结果即可。
5.实验结论或体会
1.实验开始编写步调之前，须要将hadoop启动方才可以继续编写步调。
2.步调导出的时间，须要将jar文件导出到相应的hadoop步调的文件夹下，这样方便步调的运行。
3.编写步调的时间，须要将导入的包逐一对应，确保所有的包都导入到步调之中。
4.TXT文件须要提前写好，方便运行步调。

免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！更多信息从访问主页：qidao123.com:ToB企服之家，中国第一个企服评测及商务社交产业平台。

页: [1]

ToB企服应用市场:ToB评测及商务社交产业平台's Archiver

云计算技能 实验七 MapReduce编程基础

云计算技能实验七 MapReduce编程基础