Many times I have seen people asking questions on StackOverflow and several forums that how can we set the number of mappers and reducers in a Hadoop based MapReduce job? Or how can we determine or calculate the number of mappers and reducers? I will try to answer these questions in this post.
How to determine the number of mappers?
It’s relatively easy to determine but harder to control the number of mappers as compared to the number of reducers.
Number of mappers can be determined as follows:
First determine that the input files are splittable or not. GZipped files and some other compressed files are inherently not splittable by the Hadoop. Normal text files, JSON docs etc. are splittable.
If the files are splittable:
1. Calculate the total size of input files.
2. The number of mappers = total size calculated above / input split size defined in Hadoop configuration.
For example, if the total size of input is 1GB and input split size is set to 128 MB then:
number of mappers = 1 x 1024 / 128 = 8 mappers.