The goal of this post is to compare two text files that may have millions of lines and compare them to quickly find out which lines differ using Oracle EXTERNAL tables.
As an Oracle professional, “external tables” are something you may already be familiar with or maybe you have heard about them but not used them. Let us see how taking 10 minutes to read this can help you compare two 10 million line text files in 10 minutes or less using about about 10 lines of code. 😉
What is an external table?
It is Oracle’s way of making an OS file act like a table. How that table is structured is controlled by the definition of the external table which spells out how the file should be interpreted.
There are myriad uses for external tables. In fact, as an Oracle professional, one day soon, you cannot imagine life without them once you have used them.
Typically, you use them to load CSV or fixed width files into the Oracle database.
Steps to create an external table
The steps involved to create and use an external table are simple:
Create a DIRECTORY object
For the location where the files will reside, create an Oracle DIRECTORY object
CREATE OR REPLACE DIRECTORY data_dir AS '/data/oracle/db/mydb/utl/';
Create an EXTERNAL table
Create the external table against a data file in above DIRECTORY
CREATE TABLE external_table (entire_row_as_col VARCHAR2 (4000)) ORGANIZATION EXTERNAL (TYPE oracle_loader DEFAULT DIRECTORY data_dir ACCESS PARAMETERS ( RECORDS DELIMITED BY NEWLINE SKIP 1 --Don’t need the header row FIELDS TERMINATED BY '~' --Comma is default if unspecified ) LOCATION ('My_Extract.20130630.0.274.csv') );
Start querying your text file!
SELECT * FROM external_table;
That was slightly more than 10 lines but you see the point, right?
A nonconformist approach to external tables:
Warning: Please use this approach only for quick and temporary solutions.
The web is chock-full with examples on the conventional ways of using external tables. Those ways are mostly not flexible and sometimes they are time-consuming to setup.
Some of the disadvantages of the conventional approach are
- Filename and format cannot change once defined
- Time consuming to setup
- Dealing with errors is not straight forward
- Cannot refine meaning of column interpretation over time
There are other disadvantages but they are not discussed as they are not in the context of this discussion.
The approach for CSV data (using Regular Expressions):
The non-conformist approach is – each complete line in the file becomes a row in the external table with a single column. Essentially, the rows are very wide. If you did not notice, the above external table example did this by defining a single VARCHAR2(4000) column named “entire_row_as_col” to hold an entire line’s worth of data.
This immediately gives us a few advantages:
- Defining the external table is easy
- ..not to mention, fast!
- Any file could be boxed into this external table format (just need to replace filename)
- Interpretation, like data conversion, can be done as we please
One would say, the complexity just moved away from the external table definition to the query. The answer is an emphatic “No” with regular expression support. If the columns in the text file had a “comma” separator, your query would look like this:
WITH my_csv_table AS ( SELECT 'SOMESTRING,NEXTSTRING,THIRDSTRING' AS entire_row_as_col FROM DUAL ) SELECT REGEXP_SUBSTR(t.entire_row_as_col, '[^,]+', 1, 1) col_one, REGEXP_SUBSTR(t.entire_row_as_col, '[^,]+', 1, 2) col_two, REGEXP_SUBSTR(t.entire_row_as_col, '[^,]+', 1, 3) col_three FROM my_csv_table t
The above is a runnable sample you could try but for the external table we created, the query is simpler:
SELECT REGEXP_SUBSTR(t.entire_row_as_col, '[^,]+', 1, 1) col_one, REGEXP_SUBSTR(t.entire_row_as_col, '[^,]+', 1, 2) col_two, REGEXP_SUBSTR(t.entire_row_as_col, '[^,]+', 1, 3) col_three FROM external_table t
The above example has comma as a separator and illustrates how easy it was to split strings using the regular expression substring feature. The expression would be slightly different for other separators but nonetheless, simple. The above regular expression ‘[^,]+’ means – give me one or more characters (+) that are not (^) in list (,) where the list is enclosed in square brackets ([ ]). The last parameter of REGEXP_SUBSTR is to find the Nth occurrence.
The approach for fixed-width data (without Regular Expressions):
The approach is not too difficult if the file is a fixed width format file either. Assuming that our file is a fixed width file with each column taking exactly 15 characters (for simplicity), our query would become:
WITH my_fixed_width_table AS ( -- --15-character column fixed width on all columns -- SELECT 'SOMESTRING NEXTSTRING THIRDSTRING ' AS entire_row_as_col FROM DUAL ) SELECT TRIM(SUBSTR(t.entire_row_as_col, 1, 15)) col_one, TRIM(SUBSTR(t.entire_row_as_col, 16, 15)) col_two, TRIM(SUBSTR(t.entire_row_as_col, 31, 15)) col_three FROM my_fixed_width_table t
Again, the above is a runnable sample you could try but for the external table we created, the query is simpler:
SELECT TRIM(SUBSTR(t.entire_row_as_col, 1, 15)) col_one, TRIM(SUBSTR(t.entire_row_as_col, 16, 15)) col_two, TRIM(SUBSTR(t.entire_row_as_col, 31, 15)) col_three FROM external_table t
Maybe, we are only interested in a couple of columns like a key column and a specific data column. Now, all of a sudden, we got a jump start at querying the text file. The order or rows in the file is no longer an issue.
How to compare two sets of data?
Earlier, I wrote a post on how to compare two sets of data to find the differences. It is a quick read
How to do this in SQL Server?
SQL Server introduced EXTERNAL TABLE in SQL 2016. With some minor syntax changes, you should be able to do the same as Oracle. If you are on an older version of SQL Server, OPENROWSET with a FORMAT file option should work quite nicely.
Look like a “Data Magician”!
Did your boss just ask you to compare two fixed-width text files with a few million lines of text based on a key (the first 10 characters) and the 10th column which spans from character 24 to 39 to see which rows have a difference and produce a report? It is a piece of cake that should take you only a few minutes.
- Define two external tables with one column representing the complete row of text.
- Next, define a view on top of the two external tables with a key column and the data column
- Join the two views on the key and compare the data column
- Export the results to produce the report
You will look like a data magician. I bet that your predecessor took a lot longer to produce this report!