Showing posts with label kettle. Show all posts
Showing posts with label kettle. Show all posts

Friday, November 8, 2013

Importing data into a MongoDB database using Kettle from a csv file

In a previous post we saw how to import data into a MongoDB database from a cvs file using mongoimport. In this post we will import data into the MongoDB database from a cvs file using Kettle.

This youtube video show how to import data from a text file into MongoDB
http://www.youtube.com/watch?v=Tgyrd1UiQhE

The screen below shows the configuration of the CSV input


The screen below show the configuration of the MogoDB output



The below screen shows the transformation between the CVS file input ->  MongoDB output was created, and run.



Next we will connect to MongoDB, to see if the data has been indeed imported. Connect to the mongo shell

 > use mydbdemo
switched to db mydbdemo
> db.projects.find().toArray()
[
        {
                "_id" : ObjectId("527d4e5e73ca0be1b2bc2504"),
                "ISV Name" : "Oracle",
                "Storage Brand" : "SVC",
                "Project" : "SVC Chubby nodes and Oracle",
                "Owner" : "Mayur"
        },
        {
                "_id" : ObjectId("527d4e5e73ca0be1b2bc2505"),
                "ISV Name" : "Oracle",
                "Storage Brand" : "IBM Tape",
                "Project" : "Oracle backup",
                "Owner" : "Shashank"
        },
        {
                "_id" : ObjectId("527d4e5e73ca0be1b2bc2506"),
                "ISV Name" : "Oracle",
                "Storage Brand" : "V7000",
                "Project" : "Oracle 12c on V7000",
                "Owner" : "Mayur"
        },
        {
                "_id" : ObjectId("527d4e5e73ca0be1b2bc2507"),
                "ISV Name" : "Oracle",
                "Storage Brand" : "DS8870",
                "Project" : "Easy Tier 5 for Oracle",
                "Owner" : "Mayur"
        }
]

We can also import data directly from a Microsoft Excel spreadsheet into MongoDB using Kettle.

Installing Pentaho Kettle on RedHat Linux

I downloaded Pentaho Kettle from http://sourceforge.net/projects/pentaho/files/Data%20Integration/4.4.0-stable/

[root@isvx3 ~]# java -version
java version "1.6.0_31"
Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)
[root@isvx3 ~]#
[root@isvx3 ~]# cat /proc/version
Linux version 2.6.18-348.16.1.el5 (mockbuild@x86-012.build.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-54)) #1 SMP Sat Jul 27 01:05:23 EDT 2013
[root@isvx3 ~]#
[root@isvx3 ~]# uname -m
x86_64

On untarring pdi-ce-4.4.0-stable.tar it created the data-integration directory

[root@isvx3 Desktop]# cd data-integration/
[root@isvx3 data-integration]# ls
Carte.bat                    libswt
carte.sh                     Pan.bat
Data Integration 32-bit.app  pan.sh
Data Integration 64-bit.app  plugins
docs                         plugins_old
Encr.bat                     pwd
encr.sh                      README_INFOBRIGHT.txt
generateClusterSchema.sh     README_LINUX.txt
hs_err_pid10080.log          README_OSX.txt
hs_err_pid10405.log          README_UNIX_AS400.txt
hs_err_pid3841.log           run_kettle_cluster_example.bat
hs_err_pid4875.log           runSamples.sh
hs_err_pid7690.log           samples
Import.bat                   set-pentaho-env.bat
import-rules.xml             set-pentaho-env.sh
import.sh                    simple-jndi
Kitchen.bat                  Spoon.bat
kitchen.sh                   spoon.ico
launcher                     spoon.png
lib                          spoon.sh
libext                       ui


On running spoon.sh I got a SIGSEGV

[root@isvx7 data-integration]# ./spoon.sh
/root/Desktop/data-integration
WARN  08-11 11:41:00,110 - Unable to load Hadoop Configuration from "file:///root/Desktop/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/mapr". For more information enable debug logging.
INFO  08-11 11:41:00,139 - Spoon - Logging goes to file:///tmp/spoon_5011469f-48a5-11e3-bce7-cf39551697e4.log
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000000000000, pid=30616, tid=48006614890816
#
# JRE version: 6.0_22-b04
# Java VM: Java HotSpot(TM) 64-Bit Server VM (17.1-b03 mixed mode linux-amd64 )
# Problematic frame:
# C  0x0000000000000000
#
# An error report file with more information is saved as:
# /root/Desktop/data-integration/hs_err_pid30616.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
./spoon.sh: line 163: 30616 Aborted                 "$_PENTAHO_JAVA" $OPT $STARTUP -lib $LIBPATH "${1+$@}"
[root@isvx7 data-integration]#


I found the fix to this issue on one of the stackoverflow postings: http://stackoverflow.com/questions/15943531/jdk-fatal-error-when-launching-pentaho-spoon-on-centos

[root@isvx7 data-integration]# export OPT="-Dorg.eclipse.swt.browser.XULRunnerPath=/dev/null"
[root@isvx7 data-integration]# ./spoon.sh