Home > Blogs > VMware vFabric Blog

Load 8 Million Rows in 88 Seconds – NewSQL Speed

At some point, any data modernization project is going to require a load of legacy data. With an in-memory, distributed data store like SQLFire, customers often ask (like in this case) about load times because they can be sitting on 50-100 GB of data and don’t want to wait days. For those unfamiliar with NewSQL databases, this post should give you a good sense of how we loaded 8 million rows in 88 seconds. The test shows how we should be able to load roughly 40GB of data in about 1 hour.

For Java developers who want ideas about speeding up large volumes of calculations, transforms, or validations, you may want to consider a previous post, where SQLFire is used with Spring Batch.

With SQLFire, we take a multi-threaded load approach from a CSV file. Below, I’ve outlined 8 steps to the load strategy with an explanation of why things were done.

Please note:

  • This is based on SQLFire 1.0.3.
  • In this setup we have 2 VM’s each with 18G of memory, sharing the same home directory as well as having separate disks for their own disk stores/log files.

1. Setup SQLFire in your path as shown below.

export PATH=/export/w2-gst-cert-20a/users/swale/sqlfire/build-artifacts/linux/product-sqlf/bin:$PATH

2. On VM1 start a locator as shown below

sqlf locator start -J-javaagent:/export/w2-gst-cert-20a/users/swale/sqlfire/build-
artifacts/linux/product- sqlf/lib/sqlfire.jar -peer-discovery-address=w2-gst-cert-21 -peer-discovery-
port=41111 -client-bind- address=w2-gst-cert-21 -client-port=1527 -dir=/w2-gst-cert-
21a/users/papicella/locator &

3. On both VM1 and VM2 start 2 SQLFire server nodes as shown below.
These servers have 18G of memory. VM2 is where the locator process was started. The output below is just showing what the scripts you need to run those to actually start the servers.

[Thu Jul 12 21:24:47 papicella@:~/pas/demo/load-demo ] $ cat start-server-w2-gst-cert-20.sh sqlf
server start -J-javaagent:/export/w2-gst-cert-20a/users/swale/sqlfire/build-artifacts/linux/product-
sqlf/lib/sqlfire.jar -server-groups=MYGROUP -initial-heap=8024m -max-heap=8024m -locators=w2- gst-cert-
21[41111] -client-bind-address=w2-gst-cert-20 -client-port=1527 -dir=/w2-gst-cert-20a/users/
papicella/server1 &

[Thu Jul 12 21:27:19 papicella@:~/pas/demo/load-demo ] $ cat start-server-w2-gst-cert-21.sh sqlf server start -J-javaagent:/export/w2-gst-cert-20a/users/swale/sqlfire/build-artifacts/linux/product- sqlf/lib/sqlfire.jar -server-groups=MYGROUP -initial-heap=8024m -max-heap=8024m -locators=w2- gst-cert-21[41111] -client-bind-address=w2-gst-cert-21 -client-port=1528 -dir=/w2-gst-cert-21a/users/ papicella/server2 &

4. Create a diskstore using SQL as shown below.


5. Set a global eviction policy for tables in the group “MYGROUP”.

call sys.set_eviction_heap_percentage_sg (85, 'MYGROUP');

6. Create a table as shown below.
We are using a separate disk store for persistence and any overflow data will simply be removed from memory and not written to disk as persistence is enabled here. This table will asynchonously write it’s changes to the disk store so that it only has to do the memory insert prior to moving onto the next change.

drop table emp;

(empno integer NOT NULL primary key,
ename varchar(20),
hiredate date,
deptno int

7. Pre-create buckets ready for loading data by running the SQL as shown below.
If this isn’t done, then it will be done the first time an insert occurs where no buckets exist automatically by SQLFire.

call sys.create_all_buckets('APP.EMP');

Note: You can do this after the load as shown below.

call sys.rebalance_all_buckets()

8. Run a script to load the data showing output has taken 88 seconds to load 8 million rows into a 2 node cluster.
In this example, we load a CSV file named “EMP.dat” which has 8 million rows with data as follows and a disk size of 371M.


1,LkrpCkpGQpsUvZrFWAic,2012-02-29 05:35:10,2288856
2,gfbuiiWcUxoDOgeHuth,2012-01-10 12:57:18,1243600
3,GnR,2012-03-04 20:47:06,1533872
4,uxaGakuW,2012-03-17 20:57:34,2451659
5,coIrAvnPtvOZJirnea,2012-01-29 15:36:20,3731769
6,FJvnIjy,2012-03-24 07:37:59,4137455
7,rjnWsfbLldRtUSBDdZew,2012-01-24 13:09:22,3416189
8,DpbvPBkalbuZbBItgoJp,2012-03-28 13:41:56,6497992
9,jaba,2012-01-30 00:56:13,151525
10,EkDdiPeHBn,2012-04-02 12:05:25,630473

The “EMP.dat” exists on only VM2 so we need to ensure we connect only to that SQLFire server node and ensure it’s the only one responsible for the load given EMP.dat only exists on that server. To do that we use load-balance=false and we use read-timeout=0 to ensure the client doesn’t time out if the load takes longer then 5 minutes.

loadEmp.sh Script

sqlf <<!
connect client 'w2-gst-cert-21:1528;load-balance=false;read-timeout=0';
call syscs_util.import_table_ex('APP' /* schema */,
'EMP' /* table */,
'/w2-gst-cert-21a/users/papicella/data/EMP.dat' /* file path as seen by server */,
',' /* field separator */,
0 /* don't lock the table */,
6 /* number of threads */,
NULL /* custom class for data transformation or NULL to use the default inbuilt Import class */,

Note: The number of threads set at 6 is more then enough. Making this to high can have an adverse effect here. Six is a perfect amount for a multi thread load test.

[Thu Jul 12 19:03:23 papicella@:/w2-gst-cert-21a/users/papicella/data ] $ ./loadEMP.sh
sqlf version 10.4
sqlf> sqlf> sqlf> > > > > > > > > > > Statement executed.
ELAPSED TIME = 88324 milliseconds
[Thu Jul 12 19:04:58 papicella@:/w2-gst-cert-21a/users/papicella/data ] $

Finally run a query as follows showing that indeed the data has been loaded and what the overhead of storing 8,000,000 rows was on the system for both nodes from a memory perspective.

sqlf> select sqlentity, memory, substr(id, 1, 50) "Id" FROM sys.memoryAnalytics -- SQLFIRE-PROPERTIES sizerHints=withMemoryFootPrint;
APP.EMP (Entry Size, Value Size, Row Count) | 832000000,204019274,8000000 | w2-gst-cert-21(10674)<v2>:2810/48871

APP.EMP (sqlfire,gemfire,others) | 241842808,229176368,196449984 (667469160 = 636.55 mb) | w2-gst-cert-21(10674)<v2>:2810/48871

APP.EMP (Entry Size, Value Size, Row Count) | 832000000,204019274,8000000 | w2-gst-cert-20(8550)<v1>:35007/36633

APP.EMP (sqlfire,gemfire,others) | 277524448,262689608,225528376 (765742432 = 730.27 mb) | w2-gst-cert-20(8550)<v1>:35007/36633

4 rows selected
ELAPSED TIME = 149673 milliseconds

More Information

A. Row Overhead

Currently SQLFire overhead for an entry that is persistent is around 100 bytes so that adds to around 800M overhead for 8,000,000 rows. This needs to be factored in when trying to determine memory size for tables when using persistence.

B. Number of threads to load with

When using syscs_util.import_table_ex it is best to not use more than 6 threads for the load. The benefit won’t increase if you add more threads then 6. The example below simply uses 6 threads to load the data into SQLFire, which is the most efficient way.

sqlf < connect client 'w2-gst-cert-21:1528;load-balance=false;read-timeout=0';
call syscs_util.import_table_ex('APP' /* schema */,
'EMP' /* table */,
'/w2-gst-cert-21a/users/papicella/data/EMP.dat' /* file path as seen by server */,
',' /* field separator */,
0 /* don't lock the table */,
6 /* number of threads */,
NULL /* custom class for data transformation or NULL to use the default inbuilt Import class */);

C. Connecting as a client where the data file exists

In the example below we ensure that long running load clients don’t timeout. By default, they will timeout after 5 minutes so to avoid that we use read-timeout=0 for the connect string attribute. At the same time we want to do the load from only one client where the data file exists. In that case we ensure we connect to the client where the data file exists and turn load-balance off to ensure the other node doesn’t attempt to read the file as well.

sqlf < connect client 'w2-gst-cert-21:1528;load-balance=false;read-timeout=0';
call syscs_util.import_table_ex('APP' /* schema */,
'EMP' /* table */,
'/w2-gst-cert-21a/users/papicella/data/EMP.dat' /* file path as seen by server */,
',' /* field separator */,
0 /* don't lock the table */,
6 /* number of threads */,
NULL /* custom class for data transformation or NULL to use the default inbuilt Import class */);

Screen shot 2012-08-14 at 8.30.40 PM About the Author: Pas Apicella has been with the IT industry for 16 years. He graduated from RMIT University Melbourne with a bachelor of computer science. He started out writing PLSQL web based packages for an online internet directories site. Then, he moved to java where he spent 11 years with Oracle in various roles. Pas has been with VMware since 2011 and works as a systems engineer on the vFabric Cloud Application Platform team.

55 thoughts on “Load 8 Million Rows in 88 Seconds – NewSQL Speed

  1. kamagra reviews

    I don’t even know the way I stopped up right here, but I believed this publish was good.
    I don’t understand who you are however certainly you are going
    to a famous blogger when you aren’t already.

  2. scentsy

    Hi there, just became alert to your blog through Google,
    and found that it is really informative. I’m gonna watch out for brussels.
    I’ll appreciate if you continue this in future. A lot of people will be benefited from your writing.

    Here is my site – scentsy

  3. Pampered Chef

    Hello, i read your blog from time to time and i own a
    similar one and i was just wondering if you get a lot of spam responses?
    If so how do you reduce it, any plugin or anything you can recommend?
    I get so much lately it’s driving me crazy so any help is very much appreciated.

    Also visit my weblog; Pampered Chef

  4. sbobet

    obviously like your web-site however you need to take a look at the spelling on quite a
    few of your posts. Several of them are rife with spelling issues and I find it very
    bothersome to inform the truth nevertheless I will
    definitely come again again.

    Here is my site :: sbobet


Leave a Reply

Your email address will not be published. Required fields are marked *