Simplifying complex problems is difficult but when done, the rewards are definitely worth it.
How I single-handedly managed to migrate 55 Billion rows in 4000+ tables:
Last year, I created PowerPump a PowerShell based data copy server which continuously migrated 4000+ Oracle tables to SQL Server. It was no simple task given that I had 55 billion rows (about 20 TB of data) to contend with in varying table/column formats. The largest table had over 7 billion rows in it! With few resources to pour over details, all of it had to be extremely automated.
Everything boils down to a core concept. In this post, I will explain the core concept that drove the PowerPump engine.
The core concept here is – “Break it up“. How is the metaphorical elephant eaten? By chopping it into bite-size pieces. That is exactly how I moved the billions of rows in a resumable fashion (if errors occurred).
I can promise you that this is one of those tips that you will certainly return back to refer, sometime soon and probably often.
The Problem: UPDATE 100 Million rows or more:
If I had to do an UPDATE on 100 Million rows in a table, there are many tricks to do so, but updating all 100 Million rows with a single UPDATE would certainly be at the very bottom of my list of options, for various reasons. “Chock-full death awaits anything that managed to eat an elephant in one piece”, so to speak.
In reality, I am likely to run into errors such as these after the UPDATE runs (possibly for several hours):
ORA-01555: snapshot too old: rollback segment number string with name “string” too small
The log file for database ‘[database]’ is full. Back up the transaction log for the database to free up some log space.
Quite simply, these errors relate to the inability of the corresponding databases to handle the volume of the transaction perhaps due to physical size limitations imposed by the DBA’s on transaction log files to prevent runaway transactions. It is also very likely that it is not a mistake on the DBA’s part that it is configured too small. Your transaction is too big.
“The right way to eat an elephant” – One bite at a time:
One of the tricks that is on the top of my list to do the above UPDATE is to break-up the 100 Million rows into chunks of say 100,000 or so, which is more manageable in terms of both the transaction size and the time it takes to run. I can also resume my update from the last failure point.
Oracle “Analytic functions” to the rescue:
In Oracle, there are a class of functions called “Analytic functions”. They are called “Windowing functions” in SQL Server. Since, they are defined as part of the SQL ISO standard, they are exactly the same in both Oracle and SQL Server except for the extensiveness of the implementation.
I am using sys.obj$ to illustrate this in Oracle, but any table works:
SELECT ROW_NUMBER() OVER (ORDER BY rowid) AS Row_Num, CEIL(ROW_NUMBER() OVER (ORDER BY rowid)/10) AS Batch_Number, ROWID AS RowID1 FROM sys.obj$
The above query returns breaks the fs into chunks of 10 (pretend 100k chunk) for illustration. Notice the BATCH_NUMBER column
Break-up the rows into ranges within a MIN and MAX boundary
There is a reason for including Oracle’s built-in column of every table – ROWID. We are going use that to break up our results into chunks of 100k (10 rows in this illustration):
WITH splits AS ( SELECT ROW_NUMBER() OVER (ORDER BY rowid) AS Row_Num, CEIL(ROW_NUMBER() OVER (ORDER BY rowid)/10) AS Batch_Number, ROWID AS RowID1 FROM sys.obj$ ) SELECT Batch_Number, MIN(ROWID1) AS MinRowID, MIN(ROWID1) AS MaxRowID, COUNT(1) AS Batch_Row_Count FROM splits GROUP BY Batch_Number ORDER BY Batch_Number
Now, work with the “chunks”
At this point, I would loop through these chunks that have 10 rows each and process them in batches that are more manageable in size.
The Oracle pseudo code would look something like this to update in chunks of 100,000 rows:
DECLARE v_Chunk_Size NUMBER := 100000; BEGIN FOR v_Range IN ( WITH splits AS ( SELECT ROW_NUMBER() OVER (ORDER BY rowid) AS Row_Num, CEIL(ROW_NUMBER() OVER (ORDER BY rowid)/v_Chunk_Size) AS Batch_Number, ROWID AS RowID1 FROM <<MyTable>> WHERE <<MyTable.MyCol>> <> <<Value>> ) SELECT Batch_Number, MIN(ROWID1) AS MinRowID, MIN(ROWID1) AS MaxRowID, COUNT(1) AS Batch_Row_Count FROM splits GROUP BY Batch_Number ORDER BY Batch_Number ) LOOP DBMS_OUTPUT.PUT_LINE('Updating MIN to MAX ROWID range: ' || v_Range.MinRowID || ', ' || v_Range.MaxRowID); UPDATE <<MyTable>> SET <<MyCol>> = <<Value>> WHERE rowid >= v_Range.MinRowID AND rowid <= v_Range.MaxRowID; COMMIT; END LOOP; END;
- Broke-up the results into chunks of 100,000 rows by using ROW_NUMBER() function driven by v_Chunk_Size parameter
- Fetched the resulting chunks with the beginning and ending ROWID values
- Looped through each range and updated data in the table only within that range
- Committed the results after updating each chunk
The chances of running into errors is much less using the above mechanism.
SQL Server equivalent “Windowing functions”:
The process for SQL Server is quite similar but since SQL Server does not have the built in ROWID column, a PK column like an ID column needs to be used to control the ranges.
WITH splits AS ( SELECT SalesOrderId, SalesOrderDetailID, ROW_NUMBER() OVER (ORDER BY SalesOrderDetailId) Row_Num, CEILING(ROW_NUMBER() OVER (ORDER BY SalesOrderDetailId)/10.) AS Batch_Number FROM Sales.SalesOrderDetail sod ) SELECT Batch_Number, MIN(SalesOrderDetailId) AS MinSalesOrderDetailId, MAX(SalesOrderDetailId) AS MaxSalesOrderDetailId, COUNT(1) AS Batch_Row_Count FROM splits GROUP BY Batch_Number ORDER BY Batch_Number
Devil is in the details:
The key thing to remember with SQL Server is to convert to a non-integer value by using a “decimal” as shown in the above example with “10.”. This is the same as saying “10.0”. Without the “.”, it will result in uneven splits from rounding errors of integers. It is not the result that you intend to have it you want accurate results.
To show you the difference, I have included the SQL and results of a query that uses “.” and the other that does not, with “.” being the only difference:
First, without “.”
SELECT SalesOrderId, SalesOrderDetailID, ROW_NUMBER() OVER (ORDER BY SalesOrderDetailId) Row_Num, CEILING(ROW_NUMBER() OVER (ORDER BY SalesOrderDetailId)/10) AS Batch_Number FROM Sales.SalesOrderDetail sod WHERE sod.SalesOrderID BETWEEN (43659 + 10) AND (43659 + 20)
Notice that the Batch_Number prematurely switches from 0 to 1 on row 10 instead of row 11.
Now, with “.” in “(ORDER BY SalesOrderDetailId)/10.“
Now, notice the same SQL with only “.” being different as in “(ORDER BY SalesOrderDetailId)/10.”
SELECT SalesOrderId, SalesOrderDetailID, ROW_NUMBER() OVER (ORDER BY SalesOrderDetailId) Row_Num, CEILING(ROW_NUMBER() OVER (ORDER BY SalesOrderDetailId)/10.) AS Batch_Number FROM Sales.SalesOrderDetail sod WHERE sod.SalesOrderID BETWEEN (43659 + 10) AND (43659 + 20)
Not only are the results different by just adding a decimal but also, they are correct only now. Notice that the Batch_Number changes exactly after every 10 rows as we intended.
INSERT/DELETE operations are very similar. Always keep the size of the transaction small, managed and tracked. With SQL Server you could also use the TOP clause to limit rows for DELETION and UPDATES.
SQL Server – DELETE in chunks using TOP clause:
Although I had a little blurb above, Sandra, my friend from the Charlotte SQL Server user group pointed the need to reiterate a feature where the TOP clause could be used to DELETE in chunks. Below is the exact note from her. I was hoping to write a post myself but Vuk Memarovic’s post that she references does a good job of explaining how to do this.
SQL Server has a nice way to do simple updates/deletes in chunks, which could be re-written to use a CTE or update-able view. Would this work with Oracle?: delete from t1 where a in (select top (10000) a from t1 order by a); Thanks to Vuk Memarovic for sending me this link: http://web.archive.org/web/20100212155407/http://blogs.msdn.com/sqlcat/archive/2009/05/21/fast-ordered-delete.aspx
Oh to also answer Sandra’s question, “Oracle does not have a TOP clause”. There other ways to achieve it but the TOP clause itself is not supported.
What about the 55 billion rows migration?
The concept I used for the migration of rows from Oracle to SQL Server is quite similar. PowerPump has tens of thousands of lines of code but all of it is around this simple concept – break it up
- Save the list of tables to migrate to a table
- For each table, break up the rows into chunks of certain size (as we did above)
- Store the chunk details in another “TableChunk” table
- Loop through chunks to select subset from source and copy to target
- Keep statuses to resume process from where it left (on failures)
- Size of the chunk is the size of the transaction
Obviously, this is an oversimplification of the actual process but in reality you would use extreme parallelism to drive something like this.
When breaking up and updating/deleting rows in chunks, it is best to align the value on which the chunks are split to the clustered index column at best and non-clustered index column and worst. Without the supporting indexes to guide the “range” based updates, full-table scans will be done during the fetch as well as the update for every chunk!