๐ŸŒฑ Infra/Hadoop_HDFS

Hadoop HDFS(3.3)+Spark(3.1.1) + JupyterNotebook ๋ฌด์ž‘์ • ๋”ฐ๋ผํ•˜๊ธฐ #3

mini_world 2021. 4. 27. 22:13

์ด ํฌ์ŠคํŒ…์€ ์ด์ „ ํฌ์ŠคํŒ…๊ณผ ์ด์–ด์ง‘๋‹ˆ๋‹ค. ๐Ÿ˜˜

Hadoop HDFS(3.3)+Spark(3.1.1)! ๋ฌด์ž‘์ • ๋”ฐ๋ผํ•˜๊ธฐ #2

 

Hadoop HDFS(3.3)+Spark(3.1.1)! ๋ฌด์ž‘์ • ๋”ฐ๋ผํ•˜๊ธฐ #2

์ด ํฌ์ŠคํŒ…์€ ์ด์ „ ํฌ์ŠคํŒ…๊ณผ ์ด์–ด์ง‘๋‹ˆ๋‹ค. ์ด์ „ ํฌ์ŠคํŒ…์—์„œ EC2 ํ•œ๋Œ€๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ๊ทธ ์ธ์Šคํ„ด์Šค์— ํ•„์š”ํ•œ ์†Œํ”„ํŠธ์›จ์–ด๋ฅผ ๋ชจ๋‘ ์„ค์น˜ํ•˜๊ณ , ํ™˜๊ฒฝ๋ณ€์ˆ˜์™€ ์„ค์ •ํŒŒ์ผ์„ ์ˆ˜์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ทธ ์ธ์Šคํ„ด์Šค๋ฅผ

1mini2.tistory.com

 


์ด์ „ ํฌ์ŠคํŒ… #1 ~ #2์—์„œ ๋ชจ๋“  ์ธํ”„๋ผ ๊ตฌ์ถ•์ด ์™„๋ฃŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
์ด์ œ 4๋Œ€์˜ ์ธ์Šคํ„ด์Šค์— HDFS, YARN, Spark ํด๋Ÿฌ์Šคํ„ฐ๊ฐ€ ์šด์˜์ค‘์ž…๋‹ˆ๋‹ค. ๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰

์ด๋ฒˆ ๋‹จ๊ณ„์—JupyterNotebook์„ ์„ค์น˜ํ•˜๊ณ  ์‹คํ–‰ํ•ด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.๐Ÿ˜˜
ํ•˜์ง€๋งŒ ๊ทธ ์ „์—! ๋ชจ๋“  ์„œ๋น„์Šค๊ฐ€ ์ •์ƒ์ธ์ง€ ํ™•์ธํ•ด๋ด…์‹œ๋‹ค!!

์ธํ”„๋ผ ๊ตฌ์„ฑ์ด ์™„๋ฃŒ๋œ ์‹œ์ ์ธ ์ง€๊ธˆ!!
์ง€๊ธˆ๋ถ€ํ„ฐ๋Š” ๋ชจ๋“  ๋ช…๋ น์–ด์ˆ˜ํ–‰์€ Master๋…ธ๋“œ์—์„œ๋งŒ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

 


1. HDFS ์„œ๋น„์Šค ํ™•์ธํ•˜๊ธฐ

๋”๋ณด๊ธฐ

HDFS ์„œ๋น„์Šค๋Š” CLI ํ˜น์€ ์›น์ฝ˜์†”์„ ํ†ตํ•ด ์ƒํƒœ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 

[ HDFS CLI ]

[root@master ~]# /usr/local/hadoop-3.3.0/bin/hdfs dfsadmin -report

 

[ HDFS WEB UI ]
- http://<master node IP>:9870

 


2. Yarn ์„œ๋น„์Šค ํ™•์ธํ•˜๊ธฐ

๋”๋ณด๊ธฐ

YARN ์„œ๋น„์Šค๋„  CLI ํ˜น์€ ์›น์ฝ˜์†”์„ ํ†ตํ•ด ์ƒํƒœ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 

[ YARN CLI ]

[root@master ~]# /usr/local/hadoop-3.3.0/bin/yarn node -list

 

[ YARN WEB UI ]
- http://<master node IP>:8088

 


3. Spark ์„œ๋น„์Šค ํ™•์ธํ•˜๊ธฐ

๋”๋ณด๊ธฐ

Spark์˜ ์„œ๋น„์Šค๋Š” WEB UI์—์„œ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค!! 

[ Spark WEB UI ]
- http://<master node IP>:8080

 


4. Python 3 & Conda (jupyterNotebook) & findspark ์„ค์น˜ํ•˜๊ธฐ

๋”๋ณด๊ธฐ

1) Python 3 ์„ค์น˜

์ €๋Š” Amazon Linux2 OS๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
์—ฌ๊ธฐ์—๋Š” ์ด๋ฏธ Python 3 ์ตœ์‹ ๋ฒ„์ „์ด ์„ค์น˜๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
๋งŒ์•ฝ, ์„ค์น˜๋˜์–ด์žˆ์ง€ ์•Š๋‹ค๋ฉด, "yum install python3 -y" ๋ช…๋ น์–ด๋กœ python 3๋ฅผ ์„ค์น˜ํ•ด์ฃผ์„ธ์š”.

python3 ์„ค์น˜ ๋ฒ„์ „์„ ํ™•์ธํ•ด๋ด…๋‹ˆ๋‹ค.

[root@master ~]# python3 --version

 

2) JupyterNotebook ์„ค์น˜

miniconda(์„ค์น˜ ๋งํฌ)๋กœ๋„ ์„ค์น˜๊ฐ€ ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, ์ €๋Š” ๊ฐ€์žฅ ๊ฐ„ํŽธํ•œ ๋ฐฉ๋ฒ•์ธ pip์„ ํ†ตํ•ด jupyter๋ฅผ ์„ค์น˜ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. 

[root@master ~]# pip3 install jupyter

๋ฒŒ์จ ์„ค์น˜๊ฐ€ ์™„๋ฃŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
์•„๋ž˜ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์—ฌ ํ”„๋กœ์„ธ์Šค๊ฐ€ ์‹คํ–‰๋œ๋‹ค๋ฉด ๋‹ค์Œ์œผ๋กœ ๋„˜์–ด๊ฐ‘๋‹ˆ๋‹ค!!! 

[root@master ~]# jupyter notebook --allow-root

 

3) findSpark ์„ค์น˜

findspark๋ž€? SparkContext (Spark Cluster endpoint) ๋ฅผ findSpark ํŒจํ‚ค์ง€๋กœ ์‰ฝ๊ฒŒ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋งŒ์•ฝ ์ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค๋ฉด Jypyter์—์„œ Spark์šฉ๋„์˜ ํ”„๋กœํ•„์„ ๋ณ„๋„๋กœ ์‚ฌ์šฉํ•ด์„œ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

[root@master ec2-user]# pip3 install findspark

 


5. Jupyter Notebook ์‹คํ–‰ํ•˜๊ธฐ

๋”๋ณด๊ธฐ

Jupyter Notebook ์‹คํ–‰ ์ „ ์„ค์ •ํŒŒ์ผ์„ ์ˆ˜์ •ํ•ด์ค๋‹ˆ๋‹ค!
๋จผ์ €, ์„ค์ •ํŒŒ์ผ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

[root@master ~]# jupyter notebook --generate-config

๊ทธ๋ฆฌ๊ณ  Password๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

[root@master ~]# ipython

Python 3.7.9 (default, Feb 18 2021, 03:10:35)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.22.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from notebook.auth import passwd   
In [2]: passwd()
Enter password:  // ์‚ฌ์šฉํ•  ๋น„๋ฏผ๋ฒˆํ˜ธ ์ž…๋ ฅ
Verify password:
Out[2]: 'argon2:$argon2id$v=19$m=10240,t=1 ... '  //๋ณต์‚ฌํ•ด๋†“๊ธฐ!
In [3]: exit()

๊ทธ๋ฆฌ๊ณ  Jupyter Notebook์˜ ํ™ˆ๋””๋ ‰ํ„ฐ๋ฆฌ๋ฅผ ๋งŒ๋“ค์–ด์ค๋‹ˆ๋‹ค.

[root@master ~]# mkdir /root/jupyter_dir

์ด์ œ ์„ค์ •ํŒŒ์ผ์„ ์ˆ˜์ •ํ•ฉ๋‹ˆ๋‹ค.

[root@master ~]# vim /root/.jupyter/jupyter_notebook_config.py
c.NotebookApp.ip = '0.0.0.0'
c.NotebookApp.password ='argon2:$argon2id$v=19$m=10240,t=10,p=8$t+Ki3Y…’
c.NotebookApp.notebook_dir = '/root/jupyter_dir/

์ž~! ์ด์ œ ์‹คํ–‰ํ•ด๋ด…์‹œ๋‹ค!!!!

[root@master ~]# jupyter notebook --allow-root

์›น๋ธŒ๋ผ์šฐ์ €์—์„œ ์‹คํ–‰ํ•˜๋ฉด ์ด๋ ‡๊ฒŒ ๋ณด์ž…๋‹ˆ๋‹ค.!! 
๋น„๋ฐ€๋ฒˆํ˜ธ๋ฅผ ์น˜๊ณ  ๋กœ๊ทธ์ธํ•ด๋ด…๋‹ˆ๋‹ค.

์•„๋ฌด๊ฒƒ๋„ ์—†์ฃ !!! 
ํ•œ๋ฒˆ findspark ์จ๋ด…์‹œ๋‹ค!!

๊ทธ๋ƒฅ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์ƒ์„ฑํ•˜๋Š”๊ฒƒ ๊นŒ์ง€ ์ž…๋‹ˆ๋‹ค!!

import findspark
findspark.init('/usr/local/spark-3.1.1-bin-hadoop3.2/')
import pyspark
sc = pyspark.SparkContext(master='spark://master:7077', appName='myFirstApp')
sc

์ด ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋ฉด, Spark Application์ด ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค. 
์›น UI์—์„œ๋„ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์ด ์‹คํ–‰๋˜๋Š”๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

๐Ÿ–๐Ÿ–๐Ÿ–์—ฌ๊ธฐ์„œ ์ž ๊น!๐Ÿ–๐Ÿ–๐Ÿ–

์ œ๊ฐ€ ์ž‘์„ฑํ•œ ์ด ํฌ์ŠคํŒ…์—์„œ, spark๋ฅผ ๋”ฐ๋กœ ๋„์šฐ๋Š”๋ฐ, ์ด๋ ‡๊ฒŒ ๋˜๋ฉด standalone mode ๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค ๐Ÿฅฒ
ํฌ์ŠคํŒ…ํ•œ์ง€ 1๋…„์ด ์ง€๋‚ฌ์ง€๋งŒ, ์ด์ œ์•ผ ์ด ์‚ฌ์‹ค์„ ์•Œ์•˜๋„ค์š”...

์ œ ํฌ์ŠคํŒ…์„ ๋ณด๊ณ  ๋”ฐ๋ผํ•˜์…จ๋˜ ๋ถ„๋“ค ์ด ๋ถ€๋ถ„์„ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š” ๐Ÿฅฒ
๋‹ค์Œ์— ๋” ๊ณต๋ถ€ํ•ด์„œ ์ข‹์€ ๋‚ด์šฉ์œผ๋กœ ํฌ์ŠคํŒ…ํ•˜๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

cattt๋‹˜ ์ •๋ง ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.๐Ÿ’™

 


์ถœ์ฒ˜: spark.apache.org/docs/latest/running-on-yarn.html