Oracle – Compare Text Files With 10 Million Lines In 10 Minutes Using 10 Lines Of Code – EXTERNAL Tables

The goal of this post is to compare two text files that may have millions of lines and compare them to quickly find out which lines differ using Oracle EXTERNAL tables.

As an Oracle professional, “external tables” are something you may already be familiar with or maybe you have heard about them but not used them. Let us see how taking 10 minutes to read this can help you compare two 10 million line text files in 10 minutes or less using about about 10 lines of code. 😉

What is an external table?

It is Oracle’s way of making an OS file act like a table. How that table is structured is controlled by the definition of the external table which spells out how the file should be interpreted.

There are myriad uses for external tables. In fact, as an Oracle professional, one day soon, you cannot imagine life without them once you have used them.

Typically, you use them to load CSV or fixed width files into the Oracle database.

Steps to create an external table

The steps involved to create and use an external table are simple:

Create a DIRECTORY object

For the location where the files will reside, create an Oracle DIRECTORY object

CREATE OR REPLACE DIRECTORY data_dir
AS '/data/oracle/db/mydb/utl/';

Create an EXTERNAL table

Create the external table against a data file in above DIRECTORY

CREATE TABLE external_table (entire_row_as_col VARCHAR2 (4000))
ORGANIZATION EXTERNAL
   (TYPE oracle_loader
     DEFAULT DIRECTORY data_dir
     ACCESS PARAMETERS
     (
        RECORDS DELIMITED BY NEWLINE SKIP 1     --Don’t need the header row
        FIELDS TERMINATED BY '~'                --Comma is default if unspecified
     )
     LOCATION ('My_Extract.20130630.0.274.csv')
   );

Start querying your text file!

SELECT * FROM external_table;

That was slightly more than 10 lines but you see the point, right?

A nonconformist approach to external tables:

Warning: Please use this approach only for quick and temporary solutions.

The web is chock-full with examples on the conventional ways of using external tables. Those ways are mostly not flexible and sometimes they are time-consuming to setup.

Some of the disadvantages of the conventional approach are

  • Filename and format cannot change once defined
  • Time consuming to setup
  • Dealing with errors is not straight forward
  • Cannot refine meaning of column interpretation over time

There are other disadvantages but they are not discussed as they are not in the context of this discussion.

The approach for CSV data (using Regular Expressions):

The non-conformist approach is – each complete line in the file becomes a row in the external table with a single column. Essentially, the rows are very wide.  If you did not notice, the above external table example did this by defining a single VARCHAR2(4000) column named “entire_row_as_col” to hold an entire line’s worth of data.

This immediately gives us a few advantages:

  • Defining the external table is easy
  • ..not to mention, fast!
  • Any file could be boxed into this external table format (just need to replace filename)
  • Interpretation, like data conversion, can be done as we please

One would say, the complexity just moved away from the external table definition to the query. The answer is an emphatic “No” with regular expression support. If the columns in the text file had a “comma” separator, your query would look like this:

WITH my_csv_table
AS
(
    SELECT 'SOMESTRING,NEXTSTRING,THIRDSTRING' AS entire_row_as_col
    FROM DUAL
)
SELECT REGEXP_SUBSTR(t.entire_row_as_col, '[^,]+', 1, 1) col_one,
       REGEXP_SUBSTR(t.entire_row_as_col, '[^,]+', 1, 2) col_two,
       REGEXP_SUBSTR(t.entire_row_as_col, '[^,]+', 1, 3) col_three
FROM my_csv_table t

The above is a runnable sample you could try but for the external table we created, the query is simpler:

SELECT REGEXP_SUBSTR(t.entire_row_as_col, '[^,]+', 1, 1) col_one,
       REGEXP_SUBSTR(t.entire_row_as_col, '[^,]+', 1, 2) col_two,
       REGEXP_SUBSTR(t.entire_row_as_col, '[^,]+', 1, 3) col_three
FROM external_table t

The above example has comma as a separator and illustrates how easy it was to split strings using the regular expression substring feature. The expression would be slightly different for other separators but nonetheless, simple. The above regular expression ‘[^,]+’ means – give me one or more characters (+) that are not (^) in list (,) where the list is enclosed in square brackets ([ ]). The last parameter of REGEXP_SUBSTR is to find the Nth occurrence.

The approach for fixed-width data (without Regular Expressions):

The approach is not too difficult if the file is a fixed width format file either. Assuming that our file is a fixed width file with each column taking exactly 15 characters (for simplicity), our query would become:

WITH my_fixed_width_table
AS
(
--
--15-character column fixed width on all columns
--
    SELECT 'SOMESTRING     NEXTSTRING     THIRDSTRING    ' AS entire_row_as_col
    FROM DUAL
)
SELECT TRIM(SUBSTR(t.entire_row_as_col, 1, 15)) col_one,
       TRIM(SUBSTR(t.entire_row_as_col, 16, 15)) col_two,
       TRIM(SUBSTR(t.entire_row_as_col, 31, 15)) col_three
FROM my_fixed_width_table t

Again, the above is a runnable sample you could try but for the external table we created, the query is simpler:

SELECT TRIM(SUBSTR(t.entire_row_as_col, 1, 15)) col_one,
       TRIM(SUBSTR(t.entire_row_as_col, 16, 15)) col_two,
       TRIM(SUBSTR(t.entire_row_as_col, 31, 15)) col_three
FROM external_table t

Maybe, we are only interested in a couple of columns like a key column and a specific data column. Now, all of a sudden, we got a jump start at querying the text file. The order or rows in the file is no longer an issue.

How to compare two sets of data?

Earlier, I wrote a post on how to compare two sets of data to find the differences. It is a quick read

https://sqljana.wordpress.com/2016/12/22/oracle-or-sql-server-find-row-level-data-differences-using-minusexcept-100-level/

How to do this in SQL Server?

SQL Server introduced EXTERNAL TABLE in SQL 2016. With some minor syntax changes, you should be able to do the same as Oracle. If you are on an older version of SQL Server, OPENROWSET with a FORMAT file option should work quite nicely.

Look like a “Data Magician”!

Did your boss just ask you to compare two fixed-width text files with a few million lines of text based on a key (the first 10 characters) and the 10th column which spans from character 24 to 39 to see which rows have a difference and produce a report? It is a piece of cake that should take you only a few minutes.

  • Define two external tables with one column representing the complete row of text.
  • Next, define a view on top of the two external tables with a key column and the data column
  • Join the two views on the key and compare the data column
  • Export the results to produce the report

You will look like a data magician. I bet that your predecessor took a lot longer to produce this report!

Advertisements

One thought on “Oracle – Compare Text Files With 10 Million Lines In 10 Minutes Using 10 Lines Of Code – EXTERNAL Tables

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s