2005-02-10

The Thirty Million Question

Recently I went for an interview where I was asked this question: How do you load thirty millions entries from a file into a database within 4 hours. 

I was puzzled on its significance. I have been using postgresql and loading 30 million entries is nothing special at all


Turned out they were using MS SQL. I was not sure what the deal was with batch loading of 30 million entries in MS SQL as I have minimum experience with that product. They cited a problem they encountered: apparently, they could not load 30 million entries within 4 hours without resorting to a technique which they had expected me to answer. Also, they said there was a problem with transaction log being overflowed. Beat me up, but I don't expect MS SQL to be having such problems, especially not on a high-end database machine that they had (it had 24 CPU, and I assume an equally impressive disk system and memory)

 I was not able to come up with an answer that they were looking for, that is, to break up the data into sections and bulk-load each section from separate session. Each session was to be initiated from a different computer, so as to reduce the load. (Note, though, that the question was highly MSSQL-specific. As I demonstrate below, postgresql has not any problem with loading 30 million entries).


Of course, the point of their question was not that loading 30 millions entries was hard, but rather, how do you load a large data within as short a time as possible.
 
So I was left thinking how loading from different sessions could hasten the loading process. Sure, I've read debates on postgresql mailing list on how loading from multiple sessions could reduce the time, but I have never paid any attention to it since I was not interested in it at that time.
I did not know much about the subject and had no opinion on that. So I accepted their answer for the time being. But, ever the curious type, I decided to conduct my own experiment.
I set on trying to answer three questions:
  1. Is loading 30 millions data in 4 hours hard to accomplish with postgresql?
  2. Does postgresql have any problem with loading 30 millions data in a single transaction?
  3. Does loading data from multiple sessions hasten the process?
For questions #1 and #2, the experiment was done on my laptop, a 1.3 GHz Pentium M with 512 MB RAM and a 4200 RPM disk, and running the assorted programs that I usually run: emacs, X, kde, etc.
This was, by no means, a dedicated database nor powerful machine. The experiment was going on while I continued doing what I usually do: web browsing, software development, chatting, etc. I do not expect to get consistent results between runs due to other ongoing activities nor do consistent results matter much since what I was looking for was an upper bound.

The experiment to answer question #3 was performed on a two 3 GHz Pentium 4 machine with 512 MB RAM and a RAID-5 array with 10K RPM SCSI disks. This was a dedicated database machine and I expected to get a more consistent results.

I expected that if multiple sessions turn out to hasten the process, then the gain would depend on how many indexes in the table. Data loading itself is an IO-bound activity, so, even with multiple sessions, the total write throughput would still be constrained by the maximum write rate of the disk system. On the other hand, indexing would involve a significant amount of computation. So, that would be limited by the amount of computing power available. You could probably run more than n sessions on a n-CPU computer since the CPUs would be idle some of the time while waiting for disk I/O. Yet, at some point, the overhead of context switches would be too big, so certainly there is a limit on the number of sessions. What is the limit? I don't know but I suspect that would depend very much on your system.

 

Questions #1 and #2

First, I write a simple script to generate the data:
#!/usr/bin/env ruby1.8
#/tmp/gen_data.rb

ARGV[0].to_i.upto(ARGV[1].to_i){|x|
  puts "#{x}\t#{10000+x}\t#{rand 500}\tdescription-#{x}"
} 
 
Sure, the data file they are importing would not be this simple. But that should not matter much since the most important thing, i.e.: reserving a row for the new data, is done equal amount of times. Populating the row mostly depends on your write throughput of your I/O data.
Then I generated 4 data files:
 
dede:~$ /usr/bin/time /tmp/gen_data.rb 1 30000000 > /tmp/data1.txt
333.07user 13.80system 6:29.38elapsed 89%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (5major+501minor)pagefaults 0swaps

dede:~$ /usr/bin/time /tmp/gen_data.rb 30000001 60000000 > /tmp/data2.txt
331.99user 13.95system 6:25.52elapsed 89%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (11major+477minor)pagefaults 0swaps

dede:~$ /usr/bin/time /tmp/gen_data.rb 60000001 90000000 > /tmp/data3.txt
322.55user 13.38system 6:13.09elapsed 90%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+488minor)pagefaults 0swaps

dede:~$ /usr/bin/time /tmp/gen_data.rb 90000001 120000000 > /tmp/data4.txt
324.78user 13.67system 7:19.02elapsed 77%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+488minor)pagefaults 0swaps

dede:~$ ls -lah /tmp/data*.txt
-rw-r--r--  1 ysantoso ysantoso 1.2G Feb  9 21:20 /tmp/data1.txt
-rw-r--r--  1 ysantoso ysantoso 1.2G Feb  9 21:45 /tmp/data2.txt
-rw-r--r--  1 ysantoso ysantoso 1.2G Feb  9 22:05 /tmp/data3.txt
-rw-r--r--  1 ysantoso ysantoso 1.3G Feb 10 22:30 /tmp/data4.txt

dede:~$ head -1 /tmp/data1.txt
0       10000   389     description-0

dede:~$ tail -1 /tmp/data1.txt
30000000 30010000 278 description-30000000

Then I created a database and a table in postgresql v. 7.4.6:
thirtymillion=> create table transactions (tid int4, mssinceunixepoch int4, amount int8, description text);
CREATE TABLE
thirtymillion=> \d transactions
      Table "public.transactions"
      Column      |  Type   | Modifiers
------------------+---------+-----------
 tid              | integer |
 mssinceunixepoch | integer |
 amount           | bigint  |
 description      | text    |

All the column names are fictitious.
I did not declare a primary key yet because I wanted to time data loading without any indexing as having a primary key implies having a unique index.
 
thirtymillion=# copy transactions from '/tmp/data.txt';
COPY
Time: 1639971.939 ms (27m20s)
So, 27m20s to load the first 30 millions. That's not bad at all. But how much time does indexing takes?
thirtymillion=> alter table transactions add primary key (tid);
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "transactions_pkey" for table "transactions"
ALTER TABLE
Time: 1374100.087 ms (22m54s)

The total for loading and indexing the first thirty millions was: 27m20s + 22m54s = 50m14s.
Good news! That was nowhere near 4 hours! Could this be a fluke?
Let's load the next three thirty-million datasets:
 
dede:~$ echo "copy transactions from '/tmp/data2.txt' "| /usr/bin/time sudo -u postgres psql thirtymillion
COPY
0.01user 0.00system 38:19.59elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2286minor)pagefaults 0swaps

dede:~$ echo "copy transactions from '/tmp/data3.txt' "| /usr/bin/time sudo -u postgres psql thirtymillion
COPY
0.02user 0.01system 44:00.94elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (12major+2274minor)pagefaults 0swaps

dede:~$ echo "copy transactions from '/tmp/data4.txt' "| /usr/bin/time sudo -u postgres psql thirtymillion
COPY
0.01user 0.01system 45:03.43elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (22major+2264minor)pagefaults 0swaps

Don't heed the 0%CPU figure, as that's the CPU consumption of the client (psql), and not the server.
But, as you can see, the figures: 38m20s, 44m01s, and 45m03s are not that different from each other, and more importantly, all are under 4 hours. Notice that loading the first dataset took significantly longer than the rest. I do not know exactly why that was the case, but my hypothesis is the disk cache effectiveness is low if the loading and indexing are done separately. Perhaps the hypothesis is wrong, but what is important is I could now answer the first two questions:

  • Is loading 30 millions data in 4 hours hard to accomplish with postgresql? No
  • Does postgresql have any problem with loading 30 millions data in a single transaction? No

 

Question #3

I setup several data files, each contained 30 million entries for four sub-experiments:
  1. Time separate loading and indexing as in the previous experiment
  2. Time combined loading and indexing.
  3. Time two concurrent loading and indexing sessions. The machine had two CPUs, so each CPU handled at most a session.
  4. Time six concurrent loading and indexing sessions. The idea was to simulate high CPU and IO consumptions condition.
A sample of data generation:
 
$ /usr/bin/time ./gen_data.rb 1 30000000 > data1.txt
180.55user 4.74system 3:05.40elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (17major+531minor)pagefaults 0swaps

Now, let's time loading the first thirty million without indexing it:
 
$ echo "copy transactions from '/c1/tmp/30mil/data1.txt'"| sudo -u postgres /usr/bin/time psql thirtymillion
COPY
0.00user 0.01system 12:26.37elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2015minor)pagefaults 0swaps

The time, 12m26s, about half of 27m20s as in the laptop case, was as expected since loading should be an IO-bound process. A 10K RPM disk should be able to write data about twice as quickly as a 4200 RPM disk.

Indexing it:
 
$ echo "alter table transactions add primary key (tid)" | /usr/bin/time psql thirtymillion
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "transactions_pkey" for table "transactions"
ALTER TABLE
0.00user 0.01system 8:55.03elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2024minor)pagefaults 0swaps

Total time for the first batch was: 12m26s + 8m55s = 21m21s.

As expected, indexing benefited from a more powerful CPU. Then, could concurrent sessions reduce the total time simply because that uses the other CPU? Before I went on answering that question, I timed the combined loading and indexing of the data.
 
$ echo "copy transactions from '/c1/tmp/30mil/data2.txt'"| sudo -u postgres /usr/bin/time psql thirtymillion
COPY
0.00user 0.01system 15:00.04elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2025minor)pagefaults 0swaps

15m00s, a bit faster than doing the loading and indexing separately.
Now let's time two-concurrent loading and indexing:
 
$ echo "copy transactions from '/c1/tmp/30mil/data3.txt'"| sudo -u postgres /usr/bin/time psql thirtymillion & 
echo "copy transactions from '/c1/tmp/30mil/data4.txt'"| sudo -u postgres /usr/bin/time psql thirtymillion &

COPY
0.01user 0.01system 30:41.26elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1major+2016minor)pagefaults 0swaps
COPY
0.01user 0.01system 31:05.41elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1major+2017minor)pagefaults 0swaps
 
It took about 30min to load 2*30=60 millions rows. It's twice as long as loading 30 million rows. The loading time seems to grow linearly.
If the time taken is linear at 30million/15mins, then 6*30mil=180 million rows should take about 6*15min=1h30min. Let's verify:
 
 
 
$ echo "copy transactions from '/c1/tmp/30mil/data5.txt'"| sudo -u postgres /usr/bin/time psql thirtymillion &
echo "copy transactions from '/c1/tmp/30mil/data6.txt'"| sudo -u postgres /usr/bin/time psql thirtymillion &
echo "copy transactions from '/c1/tmp/30mil/data7.txt'"| sudo -u postgres /usr/bin/time psql thirtymillion &
echo "copy transactions from '/c1/tmp/30mil/data8.txt'"| sudo -u postgres /usr/bin/time psql thirtymillion &
echo "copy transactions from '/c1/tmp/30mil/data9.txt'"| sudo -u postgres /usr/bin/time psql thirtymillion &
echo "copy transactions from '/c1/tmp/30mil/data10.txt'"| sudo -u postgres /usr/bin/time psql thirtymillion &

COPY
0.01user 0.01system 1:36:44elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2022minor)pagefaults 0swaps
COPY
0.01user 0.01system 1:38:23elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2023minor)pagefaults 0swaps
COPY
0.01user 0.01system 1:39:05elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2011minor)pagefaults 0swaps
COPY
0.01user 0.02system 1:39:09elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2016minor)pagefaults 0swaps
COPY
0.01user 0.01system 1:39:18elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2011minor)pagefaults 0swaps
COPY
0.01user 0.02system 1:39:20elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2020minor)pagefaults 0swaps 
 
Indeed! 180 million rows took about 1h30min.
 
So, 12*30million= 360 millions should take about 12*15min = 3 hours. Let's verify:
Twelve! concurrent bulk-loading with only 2 CPUs to perform those: 
$ echo "copy transactions from '/c1/tmp/30mil/data11.txt'"| sudo -u postgres /usr/bin/time psql thirtymillion &
echo "copy transactions from '/c1/tmp/30mil/data12.txt'"| sudo -u postgres /usr/bin/time psql thirtymillion &
echo "copy transactions from '/c1/tmp/30mil/data13.txt'"| sudo -u postgres /usr/bin/time psql thirtymillion &
echo "copy transactions from '/c1/tmp/30mil/data14.txt'"| sudo -u postgres /usr/bin/time psql thirtymillion &
echo "copy transactions from '/c1/tmp/30mil/data15.txt'"| sudo -u postgres /usr/bin/time psql thirtymillion &
echo "copy transactions from '/c1/tmp/30mil/data16.txt'"| sudo -u postgres /usr/bin/time psql thirtymillion &
echo "copy transactions from '/c1/tmp/30mil/data17.txt'"| sudo -u postgres /usr/bin/time psql thirtymillion &
echo "copy transactions from '/c1/tmp/30mil/data18.txt'"| sudo -u postgres /usr/bin/time psql thirtymillion &
echo "copy transactions from '/c1/tmp/30mil/data19.txt'"| sudo -u postgres /usr/bin/time psql thirtymillion &
echo "copy transactions from '/c1/tmp/30mil/data20.txt'"| sudo -u postgres /usr/bin/time psql thirtymillion &
echo "copy transactions from '/c1/tmp/30mil/data21.txt'"| sudo -u postgres /usr/bin/time psql thirtymillion &
echo "copy transactions from '/c1/tmp/30mil/data22.txt'"| sudo -u postgres /usr/bin/time psql thirtymillion &

COPY
0.01user 0.02system 3:29:18elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2018minor)pagefaults 0swaps
COPY
0.01user 0.02system 3:32:57elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2016minor)pagefaults 0swaps
COPY
0.01user 0.01system 3:33:27elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2017minor)pagefaults 0swaps
COPY
0.01user 0.02system 3:33:41elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2017minor)pagefaults 0swaps
COPY
0.01user 0.02system 3:33:52elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2021minor)pagefaults 0swaps
COPY
0.01user 0.02system 3:34:12elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2016minor)pagefaults 0swaps
COPY
0.01user 0.01system 3:34:37elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2019minor)pagefaults 0swaps
COPY
0.01user 0.02system 3:34:39elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2019minor)pagefaults 0swaps
COPY
0.01user 0.02system 3:35:08elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2019minor)pagefaults 0swaps
COPY
0.01user 0.01system 3:35:16elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2014minor)pagefaults 0swaps
COPY
0.01user 0.01system 3:35:19elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2012minor)pagefaults 0swaps
COPY
0.01user 0.01system 3:35:23elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2011minor)pagefaults 0swaps 
 
It's close. Predicted 3 hours, actual 3h35min. 
 
I didn't really pay attention to the CPU consumption during the run, so I couldn't tell if the slowdown was due to the process being CPU-bound or IO-bound.
But I now can answer question #3:

  • Does loading data from multiple sessions hasten the process? No.
If anything, loading data from too many sessions actually slows down the process.
 
 
By the end of the experimentation, I've loaded more than 500 million rows into pgsql at a remarkably steady rate of 30 million rows /15 mins. I am not sure why they needed 4 hours to do the same. Even accounting for various referential integrity checkings that may be present, 4 hours is still 16x 15 mins.

I think they need to look closer at their DBA...