Eating An Elephant – How To Work With Huge Datasets In Oracle And SQL Server

Simplifying complex problems is difficult but when done, the rewards are definitely worth it.

How I single-handedly managed to migrate 55 Billion rows in 4000+ tables:

Last year, I created PowerPump a PowerShell based data copy server which continuously migrated 4000+ Oracle tables to SQL Server. It was no simple task given that I had 55 billion rows (about 20 TB of data) to contend with in varying table/column formats. The largest table had over 7 billion rows in it! With few resources to pour over details, all of it had to be extremely automated.

Everything boils down to a core concept. In this post, I will explain the core concept that drove the PowerPump engine.

The core concept here is – “Break it up“. How is the metaphorical elephant eaten? By chopping it into bite-size pieces. That is exactly how I moved the billions of rows in a resumable fashion (if errors occurred).

I can promise you that this is one of those tips that you will certainly return back to refer, sometime soon and probably often.

The Problem: UPDATE 100 Million rows or more:

If I had to do an UPDATE on 100 Million rows in a table, there are many tricks to do so, but updating all 100 Million rows with a single UPDATE would certainly be at the very bottom of my list of options, for various reasons. “Chock-full death awaits anything that managed to eat an elephant in one piece”, so to speak.

In reality, I am likely to run into errors such as these after the UPDATE runs (possibly for several hours):

Oracle:

ORA-01555: snapshot too old: rollback segment number string with name “string” too small

SQL Server:

The log file for database ‘[database]’ is full. Back up the transaction log for the database to free up some log space.

Quite simply, these errors relate to the inability of the corresponding databases to handle the volume of the transaction perhaps due to physical size limitations imposed by the DBA’s on transaction log files to prevent runaway transactions. It is also very likely that it is not a mistake on the DBA’s part that it is configured too small. Your transaction is too big.

“The right way to eat an elephant” – One bite at a time:

One of the tricks that is on the top of my list to do the above UPDATE is to break-up the 100 Million rows into chunks of say 100,000 or so, which is more manageable in terms of both the transaction size and the time it takes to run. I can also resume my update from the last failure point.

Oracle “Analytic functions” to the rescue:

In Oracle, there are a class of functions called “Analytic functions”. They are called “Windowing functions” in SQL Server. Since, they are defined as part of the SQL ISO standard, they are exactly the same in both Oracle and SQL Server except for the extensiveness of the implementation.

I am using sys.obj$ to illustrate this in Oracle, but any table works:

SELECT
	ROW_NUMBER() OVER (ORDER BY rowid) AS Row_Num,
	CEIL(ROW_NUMBER() OVER (ORDER BY rowid)/10) AS Batch_Number,
	ROWID AS RowID1
FROM
	sys.obj$

The above query returns breaks the fs into chunks of 10 (pretend 100k chunk) for illustration. Notice the BATCH_NUMBER column

ROW_NUM BATCH_NUMBER ROWID1
1 1 AAAAASAABAAAAA+AAA
2 1 AAAAASAABAAAAA+AAB
3 1 AAAAASAABAAAAA+AAC
4 1 AAAAASAABAAAAA+AAD
5 1 AAAAASAABAAAAA+AAE
6 1 AAAAASAABAAAAA+AAF
7 1 AAAAASAABAAAAA+AAG
8 1 AAAAASAABAAAAA+AAH
9 1 AAAAASAABAAAAA+AAI
10 1 AAAAASAABAAAAA+AAJ
11 2 AAAAASAABAAAAA+AAK
12 2 AAAAASAABAAAAA+AAL
13 2 AAAAASAABAAAAA+AAM
14 2 AAAAASAABAAAAA+AAN
15 2 AAAAASAABAAAAA+AAO
16 2 AAAAASAABAAAAA+AAP
17 2 AAAAASAABAAAAA+AAQ
18 2 AAAAASAABAAAAA+AAR
19 2 AAAAASAABAAAAA+AAS
20 2 AAAAASAABAAAAA+AAT
21 3 AAAAASAABAAAAA+AAU
22 3 AAAAASAABAAAAA+AAV
23 3 AAAAASAABAAAAA+AAW
24 3 AAAAASAABAAAAA+AAX
25 3 AAAAASAABAAAAA+AAY
26 3 AAAAASAABAAAAA+AAZ
27 3 AAAAASAABAAAAA+AAa
28 3 AAAAASAABAAAAA+AAb
29 3 AAAAASAABAAAAA+AAc
30 3 AAAAASAABAAAAA+AAd
31 4 AAAAASAABAAAAA+AAe
…more rows

Break-up the rows into ranges within a MIN and MAX boundary

There is a reason for including Oracle’s built-in column of every table – ROWID. We are going use that to break up our results into chunks of 100k (10 rows in this illustration):

WITH splits
AS
(
    SELECT
        ROW_NUMBER() OVER (ORDER BY rowid) AS Row_Num,
        CEIL(ROW_NUMBER() OVER (ORDER BY rowid)/10) AS Batch_Number,
        ROWID AS RowID1
    FROM
        sys.obj$
)
SELECT
    Batch_Number,
    MIN(ROWID1) AS MinRowID,
    MIN(ROWID1) AS MaxRowID,
    COUNT(1) AS Batch_Row_Count
FROM
    splits
GROUP BY
    Batch_Number
ORDER BY
Batch_Number
BATCH_NUMBER MINROWID MAXROWID BATCH_ROW_COUNT
1 AAAAASAABAAAAA+AAA AAAAASAABAAAAA+AAA 10
2 AAAAASAABAAAAA+AAK AAAAASAABAAAAA+AAK 10
3 AAAAASAABAAAAA+AAU AAAAASAABAAAAA+AAU 10
4 AAAAASAABAAAAA+AAe AAAAASAABAAAAA+AAe 10
5 AAAAASAABAAAAA+AAo AAAAASAABAAAAA+AAo 10
6 AAAAASAABAAAAA+AAy AAAAASAABAAAAA+AAy 10
7 AAAAASAABAAAAA+AA8 AAAAASAABAAAAA+AA8 10
8 AAAAASAABAAAAA+ABG AAAAASAABAAAAA+ABG 10
9 AAAAASAABAAAAA+ABQ AAAAASAABAAAAA+ABQ 10
10 AAAAASAABAAAAA+ABa AAAAASAABAAAAA+ABa 10
11 AAAAASAABAAAAA+ABk AAAAASAABAAAAA+ABk 10
12 AAAAASAABAAAAErAAC AAAAASAABAAAAErAAC 10
13 AAAAASAABAAAAErAAM AAAAASAABAAAAErAAM 10
14 AAAAASAABAAAAErAAW AAAAASAABAAAAErAAW 10
15 AAAAASAABAAAAErAAg AAAAASAABAAAAErAAg 10
16 AAAAASAABAAAAErAAq AAAAASAABAAAAErAAq 10
17 AAAAASAABAAAAErAA0 AAAAASAABAAAAErAA0 10
18 AAAAASAABAAAAErAA+ AAAAASAABAAAAErAA+ 10
19 AAAAASAABAAAAErABI AAAAASAABAAAAErABI 10
20 AAAAASAABAAAAErABS AAAAASAABAAAAErABS 10
21 AAAAASAABAAAAErABc AAAAASAABAAAAErABc 10
…more rows

Now, work with the “chunks”

At this point, I would loop through these chunks that have 10 rows each and process them in batches that are more manageable in size.

The Oracle pseudo code would look something like this to update in chunks of 100,000 rows:

DECLARE
    v_Chunk_Size NUMBER := 100000;
BEGIN
    FOR v_Range IN (
        WITH splits
        AS
        (
            SELECT
                ROW_NUMBER() OVER (ORDER BY rowid) AS Row_Num,
                CEIL(ROW_NUMBER() OVER (ORDER BY rowid)/v_Chunk_Size) AS Batch_Number,
                ROWID AS RowID1
            FROM
                <<MyTable>>
            WHERE
                <<MyTable.MyCol>> <> <<Value>>
        )
        SELECT
            Batch_Number,
            MIN(ROWID1) AS MinRowID,
            MIN(ROWID1) AS MaxRowID,
            COUNT(1) AS Batch_Row_Count
        FROM
            splits
        GROUP BY
            Batch_Number
        ORDER BY
            Batch_Number
    ) LOOP

        DBMS_OUTPUT.PUT_LINE('Updating MIN to MAX ROWID range: ' || v_Range.MinRowID || ', ' || v_Range.MaxRowID);

        UPDATE <<MyTable>>
        SET <<MyCol>> = <<Value>>
        WHERE rowid >= v_Range.MinRowID
            AND rowid <= v_Range.MaxRowID;

        COMMIT;

    END LOOP;
END;

Essentially, we

  • Broke-up the results into chunks of 100,000 rows by using ROW_NUMBER() function driven by v_Chunk_Size parameter
  • Fetched the resulting chunks with the beginning and ending ROWID values
  • Looped through each range and updated data in the table only within that range
  • Committed the results after updating each chunk

The chances of running into errors is much less using the above mechanism.

SQL Server equivalent “Windowing functions”:

The process for SQL Server is quite similar but since SQL Server does not have the built in ROWID column, a PK column like an ID column needs to be used to control the ranges.

WITH splits
AS
(
       SELECT
              SalesOrderId,
              SalesOrderDetailID,
              ROW_NUMBER() OVER (ORDER BY SalesOrderDetailId) Row_Num,
              CEILING(ROW_NUMBER() OVER (ORDER BY SalesOrderDetailId)/10.) AS Batch_Number
       FROM Sales.SalesOrderDetail sod
)
SELECT
    Batch_Number,
    MIN(SalesOrderDetailId) AS MinSalesOrderDetailId,
    MAX(SalesOrderDetailId) AS MaxSalesOrderDetailId,
    COUNT(1) AS Batch_Row_Count
FROM
    splits
GROUP BY
    Batch_Number
ORDER BY
    Batch_Number
Batch_Number MinSalesOrderDetailId MaxSalesOrderDetailId Batch_Row_Count
1 1 10 10
2 11 20 10
3 21 30 10
4 31 40 10
5 41 50 10
6 51 60 10
7 61 70 10
8 71 80 10
9 81 90 10
10 91 100 10
11 101 110 10
12 111 120 10
…more rows

Devil is in the details:

The key thing to remember with SQL Server is to convert to a non-integer value by using a “decimal” as shown in the above example with “10.”. This is the same as saying “10.0”. Without the “.”, it will result in uneven splits from rounding errors of integers. It is not the result that you intend to have it you want accurate results.

To show you the difference, I have included the SQL and results of a query that uses “.” and the other that does not, with “.” being the only difference:

First, without “.”

SELECT
       SalesOrderId,
       SalesOrderDetailID,
       ROW_NUMBER() OVER (ORDER BY SalesOrderDetailId) Row_Num,
       CEILING(ROW_NUMBER() OVER (ORDER BY SalesOrderDetailId)/10) AS Batch_Number
FROM Sales.SalesOrderDetail sod
WHERE sod.SalesOrderID BETWEEN (43659 + 10) AND (43659 + 20)

Notice that the Batch_Number prematurely switches from 0 to 1 on row 10 instead of row 11.

SalesOrderId SalesOrderDetailID Row_Num Batch_Number
43669 110 1 0
43670 111 2 0
43670 112 3 0
43670 113 4 0
43670 114 5 0
43671 115 6 0
43671 116 7 0
43671 117 8 0
43671 118 9 0
43671 119 10 1
43671 120 11 1
43671 121 12 1
43671 122 13 1
43671 123 14 1
43671 124 15 1
43671 125 16 1
43672 126 17 1
43672 127 18 1
43672 128 19 1
43673 129 20 2
43673 130 21 2
43673 131 22 2
43673 132 23 2
43673 133 24 2
43673 134 25 2
43673 135 26 2
…more rows

Now, with “.” in “(ORDER BY SalesOrderDetailId)/10.

Now, notice the same SQL with only “.” being different as in “(ORDER BY SalesOrderDetailId)/10.”

SELECT
       SalesOrderId,
       SalesOrderDetailID,
       ROW_NUMBER() OVER (ORDER BY SalesOrderDetailId) Row_Num,
       CEILING(ROW_NUMBER() OVER (ORDER BY SalesOrderDetailId)/10.) AS Batch_Number
FROM Sales.SalesOrderDetail sod
WHERE sod.SalesOrderID BETWEEN (43659 + 10) AND (43659 + 20)

Not only are the results different by just adding a decimal but also, they are correct only now. Notice that the Batch_Number changes exactly after every 10 rows as we intended.

SalesOrderId SalesOrderDetailID Row_Num Batch_Number
43669 110 1 1
43670 111 2 1
43670 112 3 1
43670 113 4 1
43670 114 5 1
43671 115 6 1
43671 116 7 1
43671 117 8 1
43671 118 9 1
43671 119 10 1
43671 120 11 2
43671 121 12 2
43671 122 13 2
43671 123 14 2
43671 124 15 2
43671 125 16 2
43672 126 17 2
43672 127 18 2
43672 128 19 2
43673 129 20 2
43673 130 21 3
43673 131 22 3
43673 132 23 3
43673 133 24 3
43673 134 25 3
43673 135 26 3
…more rows

INSERT/DELETE operations are very similar. Always keep the size of the transaction small, managed and tracked. With SQL Server you could also use the TOP clause to limit rows for DELETION and UPDATES.

SQL Server – DELETE in chunks using TOP clause:

Although I had a little blurb above, Sandra, my friend from the Charlotte SQL Server user group pointed the need to reiterate a feature where the TOP clause could be used to DELETE in chunks. Below is the exact note from her. I was hoping to write a post myself but Vuk Memarovic’s post that she references does a good job of explaining how to do this.

SQL Server has a nice way to do simple updates/deletes in chunks, which could be re-written to use a CTE or update-able view. Would this work with Oracle?: delete from t1 where a in (select top (10000) a from t1 order by a); Thanks to Vuk Memarovic for sending me this link: http://web.archive.org/web/20100212155407/http://blogs.msdn.com/sqlcat/archive/2009/05/21/fast-ordered-delete.aspx

Oh to also answer Sandra’s question, “Oracle does not have a TOP clause”. There other ways to achieve it but the TOP clause itself is not supported.

What about the 55 billion rows migration?

The concept I used for the migration of rows from Oracle to SQL Server is quite similar. PowerPump has tens of thousands of lines of code but all of it is around this simple concept – break it up

  1. Save the list of tables to migrate to a table
  2. For each table, break up the rows into chunks of certain size (as we did above)
  3. Store the chunk details in another “TableChunk” table
  4. Loop through chunks to select subset from source and copy to target
  5. Keep statuses to resume process from where it left (on failures)
  6. Size of the chunk is the size of the transaction

That’s it!

Obviously, this is an oversimplification of the actual process but in reality you would use extreme parallelism to drive something like this.

Caution/Recommendation:

When breaking up and updating/deleting rows in chunks, it is best to align the value on which the chunks are split to the clustered index column at best and non-clustered index column and worst. Without the supporting indexes to guide the “range” based updates, full-table scans will be done during the fetch as well as the update for every chunk!

Advertisements

One thought on “Eating An Elephant – How To Work With Huge Datasets In Oracle And SQL Server

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s