{"id":210,"date":"2016-10-25T12:10:19","date_gmt":"2016-10-25T10:10:19","guid":{"rendered":"http:\/\/whizzkit.nl\/?p=210"},"modified":"2016-10-25T12:10:19","modified_gmt":"2016-10-25T10:10:19","slug":"running-your-r-analysis-on-a-spark-cluster","status":"publish","type":"post","link":"https:\/\/whizzkit.nl\/?p=210","title":{"rendered":"Running your R analysis on a Spark cluster"},"content":{"rendered":"<h2>Challenge<\/h2>\n<p>For one of our clients we are in the process of designing a versatile data platform that can be used, among others, to run R analysis on. In this post I&#8217;ll summarise the actions\u00a0I did\u00a0to get R running using RStudio 0.99.473, R 3.2.2 and Spark 1.4.0 hereby leveraging the potential of a Spark cluster.<\/p>\n<h2>Prerequisites<\/h2>\n<p>For the\u00a0R analysis I used my very\u00a0recent (2015) MacBook Pro with 16GB of memory and Yosemite installed. Although not tested, to the best of my knowledge you should be able to get it up and running on a recent Linux distribution and Windows as well. You will need a copy of the Spark binaries to be present on your machine. I have it installed (unzipped) in \/Users\/rutger\/Development\/spark-1.4.0-bin-hadoop2.6. You can download a copy\u00a0(use a pre-built version) from<\/p>\n<p><a href=\"http:\/\/spark.apache.org\/downloads.html\" target=\"_blank\" rel=\"nofollow noopener\">http:\/\/spark.apache.org\/downloads.html<\/a><\/p>\n<p>I&#8217;m aware that at the time of writing Spark 1.4.1 has been released but I used the 1.4.0 release\u00a0to match our current dev-cluster&#8217;s version.<\/p>\n<h3>Install R<\/h3>\n<p>Download and install the R distribution for Mac OS X from <a href=\"https:\/\/cran.r-project.org\/bin\/macosx\/\" target=\"_blank\" rel=\"nofollow noopener\">https:\/\/cran.r-project.org\/bin\/macosx\/<\/a>\u00a0and install using &#8216;the&#8217; regular method.<\/p>\n<h3>Install RStudio<\/h3>\n<p>Download and install RStudio for your OS from\u00a0<a href=\"https:\/\/www.rstudio.com\/products\/rstudio\/download\/\" target=\"_blank\" rel=\"nofollow noopener\">https:\/\/www.rstudio.com\/products\/rstudio\/download\/<\/a><\/p>\n<h3>Working Spark cluster<\/h3>\n<p>Obviously, to actually run SparkR jobs, you will be needing a Spark cluster. In my case, I used a\u00a0Intel NUC-based 3-node &#8216;take it with you&#8217; cluster provisioned with Apache Hadoop 2.7.0 and Spark 1.4.0. I have one Spark master called nucluster3 and three slaves called nucluster[1-3] (the master is also a slave).<\/p>\n<p>I will not go into detail in this post on installing a development Spark cluster apart from stating that it should not give you any headache getting it up and running as long as you only edit SPARK_HOME\/conf\/spark-env.sh (add JAVA_HOME variable) and SPARK_HOME\/conf\/slaves (list all slave hostnames) and start the cluster <em>on the master node<\/em>\u00a0(nucluster3 in our\u00a0case) by using start-master.sh and start-slaves.sh. You should than be greeted on <a href=\"http:\/\/spark-master:8080\/\" target=\"_blank\" rel=\"nofollow noopener\">http:\/\/nucluster3:8080<\/a>\u00a0by:<\/p>\n<p><a href=\"http:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAANdAAAAJGRmYTdlYzQzLWI0NTUtNGUxZS04NTRhLTZjZTcxYTMxYjIwMA.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-medium wp-image-222\" src=\"http:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAANdAAAAJGRmYTdlYzQzLWI0NTUtNGUxZS04NTRhLTZjZTcxYTMxYjIwMA-300x133.png\" alt=\"aaeaaqaaaaaaaandaaaajgrmytdlyzqzlwi0ntutnguxzs04ntrhltzjztcxytmxyjiwma\" width=\"300\" height=\"133\" srcset=\"https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAANdAAAAJGRmYTdlYzQzLWI0NTUtNGUxZS04NTRhLTZjZTcxYTMxYjIwMA-300x133.png 300w, https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAANdAAAAJGRmYTdlYzQzLWI0NTUtNGUxZS04NTRhLTZjZTcxYTMxYjIwMA-768x340.png 768w, https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAANdAAAAJGRmYTdlYzQzLWI0NTUtNGUxZS04NTRhLTZjZTcxYTMxYjIwMA.png 800w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<h2>Analyze!<\/h2>\n<p>All&#8217;s set and done from the software perspective, so let&#8217;s head over to the setup and the actual running of R statements.<\/p>\n<h3>Setup<\/h3>\n<p>We need a file to work on. For this post, I used a file with zipcodes called postcode.csv. You may upload it to your HDFS cluster with HUE, but I used the hardcore way by first copying it to one of the nodes and from there, used the <strong>hadoop fs<\/strong> command to put it on HDFS:<\/p>\n<ul>\n<li><strong>scp postcode.csv\u00a0root@nucluster1:<\/strong><\/li>\n<li><strong>hadoop fs -copyFromLocal postcode.csv \/user\/root\/<\/strong><\/li>\n<li><strong>head -2 postcode.csv<\/strong><\/li>\n<\/ul>\n<p><em>&#8220;id&#8221;,&#8221;postcode&#8221;,&#8221;postcode_id&#8221;,&#8221;pnum&#8221;,&#8221;pchar&#8221;,&#8221;minnumber&#8221;,&#8221;maxnumber&#8221;,&#8221;numbertype&#8221;,&#8221;street&#8221;,&#8221;city&#8221;,&#8221;city_id&#8221;,&#8221;municipality&#8221;,&#8221;municipality_id&#8221;,&#8221;province&#8221;,&#8221;province_code&#8221;,&#8221;lat&#8221;,&#8221;lon&#8221;,&#8221;rd_x&#8221;,&#8221;rd_y&#8221;,&#8221;location_detail&#8221;,&#8221;changed_date&#8221;<\/em><br \/>\n<em>395614,&#8221;7940XX&#8221;,79408888,7940,&#8221;XX&#8221;,4,12,&#8221;mixed&#8221;,&#8221;Troelstraplein&#8221;,&#8221;Meppel&#8221;,1082,&#8221;Meppel&#8221;,119,&#8221;Drenthe&#8221;,&#8221;DR&#8221;,&#8221;52.7047653217626&#8243;,&#8221;6.1977201775604&#8243;,&#8221;209781.52077777777777777778&#8243;,&#8221;524458.25733333333333333333&#8243;,&#8221;postcode&#8221;,&#8221;2014-04-10 13:20:28&#8243;<\/em><\/p>\n<p>I did not bother to strip or use the header in this case. It&#8217;s just there. So, you&#8217;re now ready to start your analysis! Fire up RStudio, you will be greeted by the familiar R prompt, accompanied by some warnings you may safely ignore for now.<\/p>\n<p><a href=\"http:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAMLAAAAJGQyM2M3NDBjLWU1MWEtNDBkNi1iOTI4LWU3NmIxMDIwYzdiZQ.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-medium wp-image-221\" src=\"http:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAMLAAAAJGQyM2M3NDBjLWU1MWEtNDBkNi1iOTI4LWU3NmIxMDIwYzdiZQ-300x134.png\" alt=\"aaeaaqaaaaaaaamlaaaajgqym2m3ndbjlwu1mwetndbkni1ioti4lwu3nmixmdiwyzdizq\" width=\"300\" height=\"134\" srcset=\"https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAMLAAAAJGQyM2M3NDBjLWU1MWEtNDBkNi1iOTI4LWU3NmIxMDIwYzdiZQ-300x134.png 300w, https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAMLAAAAJGQyM2M3NDBjLWU1MWEtNDBkNi1iOTI4LWU3NmIxMDIwYzdiZQ-768x343.png 768w, https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAMLAAAAJGQyM2M3NDBjLWU1MWEtNDBkNi1iOTI4LWU3NmIxMDIwYzdiZQ.png 800w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>Now a few things have to be set up. R needs to know where it can find the SparkR library. In my case I use the following command:<\/p>\n<p><strong>.libPaths(c(.libPaths(), &#8216;\/Users\/rutger\/Development\/spark-1.4.0-bin-hadoop2.6\/R\/lib&#8217;))<\/strong><\/p>\n<p>No response will be given. Your R analysis will actually be an application running on the Spark cluster submitted by the sparkr-shell. So R(Studio) should know where it can find this command. Because a CSV file is to be read, the spark-csv package is also loaded:<\/p>\n<p><strong>Sys.setenv(PATH = paste(Sys.getenv(c(&#8216;PATH&#8217;)), &#8216;\/Users\/rutger\/Development\/spark-1.4.0-bin-hadoop2.6\/bin&#8217;, sep=&#8217;:&#8217;))<\/strong><\/p>\n<p><strong>Sys.setenv(&#8216;SPARKR_SUBMIT_ARGS&#8217;='&#8221;&#8211;packages&#8221; &#8220;com.databricks:spark-csv_2.11:1.1.0&#8221; &#8220;sparkr-shell&#8221;&#8216;)<\/strong><\/p>\n<p>Again, these commands should not give you any response unless you make a mistake. You are now ready to load the SparkR library:<\/p>\n<p><strong>library(SparkR)<\/strong><\/p>\n<p>This will give you some feedback, something like<\/p>\n<p><em>Attaching package: &#8216;SparkR&#8217;<\/em><\/p>\n<p>Followed by some remarks about masked packages. They are harmless in our case:<\/p>\n<p><a href=\"http:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAUrAAAAJDI5MWRlZGUyLTVlOTktNGNhOC04ZDIxLTQyY2JhYjk3MjI3OA.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-medium wp-image-220\" src=\"http:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAUrAAAAJDI5MWRlZGUyLTVlOTktNGNhOC04ZDIxLTQyY2JhYjk3MjI3OA-300x187.png\" alt=\"aaeaaqaaaaaaaauraaaajdi5mwrlzguyltvlotktngnhoc04zdixltqyy2jhyjk3mji3oa\" width=\"300\" height=\"187\" srcset=\"https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAUrAAAAJDI5MWRlZGUyLTVlOTktNGNhOC04ZDIxLTQyY2JhYjk3MjI3OA-300x187.png 300w, https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAUrAAAAJDI5MWRlZGUyLTVlOTktNGNhOC04ZDIxLTQyY2JhYjk3MjI3OA-768x478.png 768w, https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAUrAAAAJDI5MWRlZGUyLTVlOTktNGNhOC04ZDIxLTQyY2JhYjk3MjI3OA-436x272.png 436w, https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAUrAAAAJDI5MWRlZGUyLTVlOTktNGNhOC04ZDIxLTQyY2JhYjk3MjI3OA.png 800w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>The first thing we&#8217;ll have to do is create two contexts; the Spark context and the SparkRSQL context:<\/p>\n<p><strong>sc &lt;- sparkR.init(master = &#8220;spark:\/\/nucluster3:7077&#8243;, appName=&#8221;SparkR&#8221;)<\/strong><\/p>\n<p><strong>sqlContext &lt;- sparkRSQL.init(sc)<\/strong><\/p>\n<p>The first command produces a fair amount of output but should complete without warnings, the second one should finish silently. Remember the postcodes I put on HDFS? We are now ready to load them into a dataframe. nucluster1 is the cluster&#8217;s namenode, by the way.<\/p>\n<p><strong>postcodes &lt;- read.df(sqlContext, &#8220;hdfs:\/\/nucluster1:9000\/user\/root\/postcode.csv&#8221;, source = &#8220;com.databricks.spark.csv&#8221;)<\/strong><\/p>\n<p>This command should finish successfully again with a fair amount of output.\u00a0You are now ready to do your R magic, but in the examples below I will stick to <em>very<\/em> simple ones.<\/p>\n<h3>Plain R commands<\/h3>\n<p><strong>head(postcodes)<\/strong><\/p>\n<p><a href=\"http:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAKoAAAAJDVlY2Y4ZjgyLTIwOTUtNGFmNy05NDExLTk0YzFkOWIwYWRmZQ.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-medium wp-image-219\" src=\"http:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAKoAAAAJDVlY2Y4ZjgyLTIwOTUtNGFmNy05NDExLTk0YzFkOWIwYWRmZQ-300x180.png\" alt=\"aaeaaqaaaaaaaakoaaaajdvly2y4zjgyltiwotutngfmny05ndexltk0yzfkowiwywrmzq\" width=\"300\" height=\"180\" srcset=\"https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAKoAAAAJDVlY2Y4ZjgyLTIwOTUtNGFmNy05NDExLTk0YzFkOWIwYWRmZQ-300x180.png 300w, https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAKoAAAAJDVlY2Y4ZjgyLTIwOTUtNGFmNy05NDExLTk0YzFkOWIwYWRmZQ-768x460.png 768w, https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAKoAAAAJDVlY2Y4ZjgyLTIwOTUtNGFmNy05NDExLTk0YzFkOWIwYWRmZQ.png 800w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p><strong>count(postcodes)<\/strong><\/p>\n<p><a href=\"http:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAASkAAAAJDIwYTA3NDEyLTNkOTAtNGUxZC1hZjAyLWY4MjI1MzM0MDY4ZA.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-medium wp-image-218\" src=\"http:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAASkAAAAJDIwYTA3NDEyLTNkOTAtNGUxZC1hZjAyLWY4MjI1MzM0MDY4ZA-300x71.png\" alt=\"aaeaaqaaaaaaaaskaaaajdiwyta3ndeyltnkotatnguxzc1hzjaylwy4mji1mzm0mdy4za\" width=\"300\" height=\"71\" srcset=\"https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAASkAAAAJDIwYTA3NDEyLTNkOTAtNGUxZC1hZjAyLWY4MjI1MzM0MDY4ZA-300x71.png 300w, https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAASkAAAAJDIwYTA3NDEyLTNkOTAtNGUxZC1hZjAyLWY4MjI1MzM0MDY4ZA-768x182.png 768w, https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAASkAAAAJDIwYTA3NDEyLTNkOTAtNGUxZC1hZjAyLWY4MjI1MzM0MDY4ZA.png 800w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>A slightly more difficult example is to count by one of the fields. In this case only the first few lines are returned<\/p>\n<p><strong>head(summarize(groupBy(postcodes, postcodes$C3), count = n(postcodes$C3)))<\/strong><\/p>\n<p><a href=\"http:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAXDAAAAJGM2NDcxZDQ5LThkNTktNGViMS1iNjQ0LWM3ZmFhOThlZmU1Ng.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-medium wp-image-217\" src=\"http:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAXDAAAAJGM2NDcxZDQ5LThkNTktNGViMS1iNjQ0LWM3ZmFhOThlZmU1Ng-300x128.png\" alt=\"aaeaaqaaaaaaaaxdaaaajgm2ndcxzdq5lthkntktngvims1injq0lwm3zmfhothlzmu1ng\" width=\"300\" height=\"128\" srcset=\"https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAXDAAAAJGM2NDcxZDQ5LThkNTktNGViMS1iNjQ0LWM3ZmFhOThlZmU1Ng-300x128.png 300w, https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAXDAAAAJGM2NDcxZDQ5LThkNTktNGViMS1iNjQ0LWM3ZmFhOThlZmU1Ng-768x329.png 768w, https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAXDAAAAJGM2NDcxZDQ5LThkNTktNGViMS1iNjQ0LWM3ZmFhOThlZmU1Ng.png 799w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>Please note that without the &#8216;head&#8217; command, nothing will actually be executed, only the return type will be displayed:<\/p>\n<p><em>DataFrame[C3:string, count:bigint]\u00a0<\/em><\/p>\n<h3>SQL commands<\/h3>\n<p>One of the powerful features of this setup is the ability to use SQL in your commands. For this to work you&#8217;ll first have to create a temporary table definition.<\/p>\n<p><strong>registerTempTable(postcodes, &#8220;postcodes&#8221;)<\/strong><\/p>\n<p><strong>amsterdam &lt;- sql(sqlContext, &#8220;SELECT C1,C8 FROM postcodes WHERE C9 = &#8216;Amsterdam'&#8221;)<\/strong><br \/>\n<strong>head(amsterdam)<\/strong><\/p>\n<p>It may sound boring, but the first command will stay silent and the second one will return the query result (All Amsterdam zipcodes):<\/p>\n<p><a href=\"http:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAXDAAAAJGM2NDcxZDQ5LThkNTktNGViMS1iNjQ0LWM3ZmFhOThlZmU1Ng.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-medium wp-image-217\" src=\"http:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAXDAAAAJGM2NDcxZDQ5LThkNTktNGViMS1iNjQ0LWM3ZmFhOThlZmU1Ng-300x128.png\" alt=\"aaeaaqaaaaaaaaxdaaaajgm2ndcxzdq5lthkntktngvims1injq0lwm3zmfhothlzmu1ng\" width=\"300\" height=\"128\" srcset=\"https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAXDAAAAJGM2NDcxZDQ5LThkNTktNGViMS1iNjQ0LWM3ZmFhOThlZmU1Ng-300x128.png 300w, https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAXDAAAAJGM2NDcxZDQ5LThkNTktNGViMS1iNjQ0LWM3ZmFhOThlZmU1Ng-768x329.png 768w, https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAXDAAAAJGM2NDcxZDQ5LThkNTktNGViMS1iNjQ0LWM3ZmFhOThlZmU1Ng.png 799w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>Of course, you are able to write much more cpu- and disk intensive queries. This totally bogus one put&#8217;s the cluster to work for a while:<\/p>\n<p><strong>count(sql(sqlContext, &#8220;SELECT p1.C1,p1.C8 FROM postcodes p1, postcodes p2 where p1.C1&lt;p2.C1&#8221;))<\/strong><\/p>\n<p>Note that all activity is done on the <em>cluster<\/em>, not your own machine! Your machine is the job driver and <em>only<\/em> serves as such. But you can view a lot of information about your job: take a look at <a href=\"http:\/\/localhost:4040\/jobs\/\" target=\"_blank\" rel=\"nofollow noopener\">http:\/\/localhost:4040\/jobs\/<\/a>\u00a0and find your own running one.<\/p>\n<p><a href=\"http:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAQFAAAAJDBkODhlNjdiLWMyNzAtNGVkMi1hYWQwLTNjMDUxNTlhMzU1OQ.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-medium wp-image-215\" src=\"http:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAQFAAAAJDBkODhlNjdiLWMyNzAtNGVkMi1hYWQwLTNjMDUxNTlhMzU1OQ-300x117.png\" alt=\"aaeaaqaaaaaaaaqfaaaajdbkodhlnjdilwmynzatngvkmi1hywqwltnjmduxntlhmzu1oq\" width=\"300\" height=\"117\" srcset=\"https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAQFAAAAJDBkODhlNjdiLWMyNzAtNGVkMi1hYWQwLTNjMDUxNTlhMzU1OQ-300x117.png 300w, https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAQFAAAAJDBkODhlNjdiLWMyNzAtNGVkMi1hYWQwLTNjMDUxNTlhMzU1OQ-768x300.png 768w, https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAQFAAAAJDBkODhlNjdiLWMyNzAtNGVkMi1hYWQwLTNjMDUxNTlhMzU1OQ.png 800w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>In the screenshot above you will see the &#8216;sort-of-cartesian-product&#8217; I created hogging the system with a huge input set. Feel free to look around jobs, stages (note the beautiful DAG Visualization\u00a0Spark provides) and storage information. You may also\u00a0check out the Spark Master and note your running application named SparkR at\u00a0<a href=\"http:\/\/nucluster3:8080\/\" target=\"_blank\" rel=\"nofollow noopener\">http:\/\/nucluster3:8080\/<\/a><\/p>\n<p><a href=\"http:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAaHAAAAJGY2Nzk3MzYxLTMwNTYtNGE4Zi05MTA2LTNkMWFjYTZkNWViOQ.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-medium wp-image-214\" src=\"http:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAaHAAAAJGY2Nzk3MzYxLTMwNTYtNGE4Zi05MTA2LTNkMWFjYTZkNWViOQ-300x24.png\" alt=\"aaeaaqaaaaaaaaahaaaajgy2nzk3mzyxltmwntytnge4zi05mta2ltnkmwfjytzknwvioq\" width=\"300\" height=\"24\" srcset=\"https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAaHAAAAJGY2Nzk3MzYxLTMwNTYtNGE4Zi05MTA2LTNkMWFjYTZkNWViOQ-300x24.png 300w, https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAaHAAAAJGY2Nzk3MzYxLTMwNTYtNGE4Zi05MTA2LTNkMWFjYTZkNWViOQ-768x62.png 768w, https:\/\/whizzkit.nl\/wp-content\/uploads\/2016\/10\/AAEAAQAAAAAAAAaHAAAAJGY2Nzk3MzYxLTMwNTYtNGE4Zi05MTA2LTNkMWFjYTZkNWViOQ.png 800w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>I&#8217;m\u00a0done, so I quit RStudio by using the well known q command. You may choose wether to save your workspace or not!<\/p>\n<p><strong>q()<\/strong><\/p>\n<h2>Conclusions and final remarks<\/h2>\n<p>Analysing your data using familiar R while at the same time leveraging the power of a Spark cluster is a breeze once you get the hang of it. I mainly focussed on &#8216;<em>getting it up and running<\/em>&#8216; and not on &#8216;<em>beautiful R apps<\/em>&#8216;. Go and knock yourself out with libraries like Shiny (http:\/\/shiny.rstudio.com\/)!<\/p>\n<h3>Remarks<\/h3>\n<p>Of course you always find out after writing a blog post that you have missed something trivial. In case of loading the csv, spark-csv <em>will<\/em> give you column names for free if you load using:<\/p>\n<p><strong>postcodes &lt;- read.df(sqlContext, &#8220;hdfs:\/\/nucluster1:9000\/user\/root\/postcode.csv&#8221;, source = &#8220;com.databricks.spark.csv&#8221;, header = &#8220;true&#8221;)<\/strong><\/p>\n<p>Apart from that, some minor annoyances popped up:<\/p>\n<ul>\n<li>I noticed on the nodes of the cluster that only one core was used in stead of the four that are in. I must have overlooked something during configuration and will reply or update when I found the solution.<\/li>\n<li>Apart from that, I really like Spark to be less chatty (get rid of the INFO messages) and will try to create a log4j.properties file for that.<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Challenge For one of our clients we are in the process of designing a versatile data platform that can be used, among others, to run R analysis on. In this post I&#8217;ll summarise the actions\u00a0I did\u00a0to get R running using RStudio 0.99.473, R 3.2.2 and Spark 1.4.0 hereby leveraging the potential of a Spark cluster. [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":213,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/whizzkit.nl\/index.php?rest_route=\/wp\/v2\/posts\/210"}],"collection":[{"href":"https:\/\/whizzkit.nl\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/whizzkit.nl\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/whizzkit.nl\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/whizzkit.nl\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=210"}],"version-history":[{"count":3,"href":"https:\/\/whizzkit.nl\/index.php?rest_route=\/wp\/v2\/posts\/210\/revisions"}],"predecessor-version":[{"id":224,"href":"https:\/\/whizzkit.nl\/index.php?rest_route=\/wp\/v2\/posts\/210\/revisions\/224"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/whizzkit.nl\/index.php?rest_route=\/wp\/v2\/media\/213"}],"wp:attachment":[{"href":"https:\/\/whizzkit.nl\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=210"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/whizzkit.nl\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=210"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/whizzkit.nl\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=210"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}