Wednesday, August 19, 2015

Merging data into an MS Access table

One of my favorite commands in T-SQL is the MERGE statement, which allows you to merge a table expression into a table, INSERTing records if they don't exist, or UPDATEing records when they do in a single command. Unfortunately, it's a little verbose, since you still have to write the full INSERT and UPDATE statements that you would be using instead of the MERGE statement.

If you're using MS Access, you don't have MERGE. If you want to merge data into an MS Access table, you need to write the INSERT and UPDATE statements separately, and you'll have to qualify your INSERT statement in some way to prevent INSERTing keys that already exist. There are two alternatives that have some big drawbacks:

  1. You could run a DELETE beforehand to get rid of the keys that you're inserting; then you would just have to write the INSERT statement. This will cause problems if you have defined relationships on the table you're merging into: the DELETE will either fail or cascade in an inappropriate way.
  2. You could INSERT just the keys that are missing and then run the UPDATE statement to fill in the rest of the columns. This requires you to define all the non-key columns as nullable. Nullable columns cause all kinds of issues, so I prefer to avoid them unless NULLs are truly required.

I don't think there's a pure SQL solution to this problem, but there is a solution if you use ADO to merge the records. ADO Recordsets support a Seek operation, which will efficiently locate a record when you provide a key. Merging can be achieved by using Seek to search for each key you are merging, and if the key is not found, adding a new record.

I've written up an implementation of this technique in Python. Before you can use Seek, you must specify the unique index name you are using in the Index property, and you must use a server-side cursor. By default, MS Access will name a table's primary key "PrimaryKey", so this code assumes that the key you are using has this name.

import logging, os.path
from win32com.client.gencache import EnsureDispatch
from win32com.client import constants as c

log = logging.getLogger(__name__)

class AccessDb(object):
    def __init__(self, dbpath):
        self.dbpath = os.path.abspath(dbpath)

    def __enter__(self):
        self.db = EnsureDispatch('ADODB.Connection')
                     'Data source=%s' % self.dbpath)
        return self

    def __exit__(self, etyp, einst, etb):

    def bulkload(self, table, data, key=None):
        """Bulk merge data into a table.

            table (str): Name of table where data is to be loaded.
            data: Data rows to be merged in (sequence of sequences).
            key (int sequence): List of indexes of key columns. If not
                provided, all rows will be appended.
        rs = EnsureDispatch('ADODB.Recordset')
        if key:
            rs.Index = 'PrimaryKey'
            rs.CursorLocation = c.adUseServer

        rs.Open(table, self.db, c.adOpenKeyset, c.adLockOptimistic,
            for record in data:
                if key:
                    rs.Seek([record[i] for i in key], c.adSeekFirstEQ)
                    if rs.EOF:

                for index, field in enumerate(record):
                    rs.Fields(index).Value = field


            if rs.EditMode == c.adEditAdd:


Example usage:

# TestTable is defined as:
#     (id int PRIMARY KEY,
#      c1 VARCHAR(5),
#      c2 VARCHAR(5));
with AccessDb('test.accdb') as db:
                [(1, 'a', 'b'),
                 (2, 'c', 'd'),
                 (3, 'e', 'f')], key=[0])

Wednesday, August 5, 2015

Providing a parameterized list to an ad-hoc SQL query

MS SQL Server, like all SQL RDBMSes, is optimized to deal with sets of data and performs poorly when feed input one value at a time. Often you are in a situation where you need to provide a parameterized query with a list of values as input. Perversely, no matter what language you're using, most APIs only let you parameterize on atomic values. You're left with a choice between using dynamic SQL, which is a security risk, or fetching records from the database one at a time, which has a massive performance penalty.

Stored procedures allow table parameters, but you have to define the schema of the table in a separate database object, and only a few connectivity APIs support calling such stored procedures, anyway. Another method is writing a user defined function to "split" on commas, which will either have poor performance if you write it in T-SQL, or you can go the CLR route and deal with all of its associated issues.

One alternative I've found is providing the criteria in an XML document. SQL Server has very good XML support that is often overlooked, but is useful in many situations like this.

DECLARE @xml xml = N'
    <row field1="abcd" field2="261" />
    <row field1="dfke" field2="378" />
    <row field1="cirp" field2="381" />
    <row field1="vkdp" field2="381" />
    <row field1="boer" field2="381" />
    <row field1="npmr" field2="2638" />
    <row field1="qpci" field2="2638" />
    <row field1="biep" field2="2638" />

SELECT q.field1, q.field2, table1.field3
        n.value('@field1', 'varchar(12)') AS field1,
        n.value('@field2', 'int') AS field2
    FROM @xml.nodes('//row') x(n)) q
LEFT JOIN table1
ON q.field1 = table1.field1 AND
    q.field2 = table1.field2;

Since the XML document is provided to the query in a string, it's parameterizable. It's completely flexible in the kind of data that you can provide; the schema is defined and validated when you shred the XML using T-SQL's XML methods.

XML is often associated with poor performance, but these concerns are overblown unless you're dealing with documents that are tens of thousands of lines or longer, not when you're sending a few hundred lines of input into a parameterized query. But if you are providing a lot of data, you might want to shred the XML into a temporary table so that you can use indexes and a primary key.