First off, I’m sorry it’s taken so long to get to part two of this. The machine I was running this VM on had only 8GB of RAM and the VM seems to run best while consuming 8GB to itself. Clearly, that wasn’t going to work so a memory upgrade was warranted and now the host has 16GB of RAM and everything is running nicely.
I was able to move into the first tutorial, and I have to say from the standpoint of someone that has been in the MS Windows world it was a bit of an awakening to be back on the command line.
When you launch the VM using VMWare Player you’ll need to go into the configuration section of VMWare Player and reset the memory value to 8GB, then the image will run nicely. I’ve skipped a screenshot of that assuming everyone will figure it out.
When the image is launched, Firefox will launch and the tutorial will begin. Read through the first couple of pages and the first exercise starts. The first exercise is an introduction into using Sqoop (Scoop). Sqoop is a tool that will import a relational database into hadoop; schema and data.
The tutorial starts out by asking you to create a secure connection to the master node using SSH. Open the Terminal window and type:
The Terminal window will prompt for the password, cloudera.
Once you’re in, running the first command is a snap. Copy and paste it from the tutorial, and it will import the schema and tables from the local MySQL database. I’m not going to copy the command or it’s output here, you’d have to try it and see.
The import will take a bit of time because it’s creating the file structure and bringing int he data from MySQL. Once the execution halts, continue with the tutorial to determine if the files are indeed populated.
hadoop fs -ls /user/hive/warehouse
This command will give you the file structure at the table level that’s been imported. You should see a file for every table in the source database.
hadoop fs -ls /user/hive/warehouse/categories/
To see a representation of the row data that was contained in the MySQL database. Note that you don’t see the source data, and I don’t yet know what the folder structure represents, and I’ll assume that’s in a following tutorial.
Here is something super-new to me. It’s a concept! Think about it this way, and the tutorial sort of walks you through it too, but what is happening during all of this import and such is the data is coming in and being somewhat ripped apart. The relationships we know to be true and defined in a RDBMS aren’t there in hadoop, it’s actually data in files. This is a revolutionary concept to me because I’ve spent years defining structures to hold data, and worked tirelessly to ensure it’s consistency, speed of delivery, etc. You’ll have to bear with me while my head spins a bit.
OK, head spinning aside let’s get back to it. Now that the data has been imported, it isn’t available to do anything with yet. Running some more hadoop file system commands to make a hadoop directory, and then copy the data in. Again, see the actual tutorial for those, copying them here isn’t helpful.
Those commands run pretty quick and then it’s time to launch hive.
And now, it gets into creating the structures for the data. Again, my head’s spinning but I get the concept. If, let’s say, you have log files in .csv or .txt or some other format and you need to bring that in and be able to mine it, and others like it, for information. Without developing some sort of organization operation, that would be next to impossible and always, or almost always, yield different query results.
There’s a script in the tutorial to create the table structures, and how to validate they are there.
And that pretty much sums up my experience with the quickstart tutorial today.