Wednesday, October 28, 2015

Concatenating in T-SQL without abusing XML

Sometimes when you're writing an aggregate query in T-SQL you want to concatenate strings in a group together, instead of applying an operation like min, max, or count. The common suggestion is to use FOR XML PATH in a magical way that concatenates for a reason that no one understands (unless you actually take the time to learn the T-SQL XML functionality, which I highly recommend).

I'm going to demonstrate an alternate approach that involves the use of a PIVOT query. In order to use this method, you have to define the maximum number of values that will be combined. This may or may not be a drawback. (In my experience, it's actually a useful sanity check.) If you do it this way, you have a lot more flexibility in join delimiters and the syntax is more straightforward.

-- Prepare example tables. @result is an ordinary data table, and @msg is a
-- related table that stores comments for the records in @result.
    (record_id int IDENTITY(1, 1) PRIMARY KEY,
    col1 int, col2 int);
    (msg_id int IDENTITY(1, 1) PRIMARY KEY,
    record_id int NOT NULL,
    msg nvarchar(300) NOT NULL);
INSERT INTO @result (col1, col2)
VALUES (1, 2), (2, 4), (3, 6), (4, 8);
INSERT INTO @msg (record_id, msg)
    (2, N'first message'),
    (2, N'second message'),
    (4, N'another message');
-- Use pivoting to concatenate messages for the records together into a
-- single column. A CTE is used to create a temporary result set that joins
-- @result and @msg and includes a column that indicates the message
-- sequence. A pivot query flattens the result set by concatenating the
-- messages together.
    (SELECT [@result].record_id, [@result].col1, [@result].col2,
        'Msg' + CONVERT(varchar, ROW_NUMBER() OVER
            (PARTITION BY [@msg].record_id
            ORDER BY [@msg].msg_id)) AS rn
    FROM @result
    LEFT JOIN @msg
    ON [@result].record_id = [@msg].record_id)
SELECT record_id, col1, col2,
    COALESCE(Msg1, '') +
    COALESCE(', ' + Msg2, '') +
    COALESCE(', ' + Msg3, '') AS msgs
PIVOT (MAX(msg) FOR rn IN (Msg1, Msg2, Msg3)) p;
-- Results:
-- +-----------+------+------+-------------------------------+
-- | record_id | col1 | col2 | msgs                          |
-- +-----------+------+------+-------------------------------+
-- | 1         | 1    | 2    |                               |
-- | 2         | 2    | 4    | first message, second message |
-- | 3         | 3    | 6    |                               |
-- | 4         | 4    | 8    | another message               |
-- +---------------------------------------------------------+

Wednesday, August 19, 2015

Merging data into an MS Access table

One of my favorite commands in T-SQL is the MERGE statement, which allows you to merge a table expression into a table, INSERTing records if they don't exist, or UPDATEing records when they do in a single command. Unfortunately, it's a little verbose, since you still have to write the full INSERT and UPDATE statements that you would be using instead of the MERGE statement.

If you're using MS Access, you don't have MERGE. If you want to merge data into an MS Access table, you need to write the INSERT and UPDATE statements separately, and you'll have to qualify your INSERT statement in some way to prevent INSERTing keys that already exist. There are two alternatives that have some big drawbacks:

  1. You could run a DELETE beforehand to get rid of the keys that you're inserting; then you would just have to write the INSERT statement. This will cause problems if you have defined relationships on the table you're merging into: the DELETE will either fail or cascade in an inappropriate way.
  2. You could INSERT just the keys that are missing and then run the UPDATE statement to fill in the rest of the columns. This requires you to define all the non-key columns as nullable. Nullable columns cause all kinds of issues, so I prefer to avoid them unless NULLs are truly required.

I don't think there's a pure SQL solution to this problem, but there is a solution if you use ADO to merge the records. ADO Recordsets support a Seek operation, which will efficiently locate a record when you provide a key. Merging can be achieved by using Seek to search for each key you are merging, and if the key is not found, adding a new record.

I've written up an implementation of this technique in Python. Before you can use Seek, you must specify the unique index name you are using in the Index property, and you must use a server-side cursor. By default, MS Access will name a table's primary key "PrimaryKey", so this code assumes that the key you are using has this name.

import logging, os.path
from win32com.client.gencache import EnsureDispatch
from win32com.client import constants as c

log = logging.getLogger(__name__)

class AccessDb(object):
    def __init__(self, dbpath):
        self.dbpath = os.path.abspath(dbpath)

    def __enter__(self):
        self.db = EnsureDispatch('ADODB.Connection')
                     'Data source=%s' % self.dbpath)
        return self

    def __exit__(self, etyp, einst, etb):

    def bulkload(self, table, data, key=None):
        """Bulk merge data into a table.

            table (str): Name of table where data is to be loaded.
            data: Data rows to be merged in (sequence of sequences).
            key (int sequence): List of indexes of key columns. If not
                provided, all rows will be appended.
        rs = EnsureDispatch('ADODB.Recordset')
        if key:
            rs.Index = 'PrimaryKey'
            rs.CursorLocation = c.adUseServer

        rs.Open(table, self.db, c.adOpenKeyset, c.adLockOptimistic,
            for record in data:
                if key:
                    rs.Seek([record[i] for i in key], c.adSeekFirstEQ)
                    if rs.EOF:

                for index, field in enumerate(record):
                    rs.Fields(index).Value = field


            if rs.EditMode == c.adEditAdd:


Example usage:

# TestTable is defined as:
#     (id int PRIMARY KEY,
#      c1 VARCHAR(5),
#      c2 VARCHAR(5));
with AccessDb('test.accdb') as db:
                [(1, 'a', 'b'),
                 (2, 'c', 'd'),
                 (3, 'e', 'f')], key=[0])

Wednesday, August 5, 2015

Providing a parameterized list to an ad-hoc SQL query

MS SQL Server, like all SQL RDBMSes, is optimized to deal with sets of data and performs poorly when feed input one value at a time. Often you are in a situation where you need to provide a parameterized query with a list of values as input. Perversely, no matter what language you're using, most APIs only let you parameterize on atomic values. You're left with a choice between using dynamic SQL, which is a security risk, or fetching records from the database one at a time, which has a massive performance penalty.

Stored procedures allow table parameters, but you have to define the schema of the table in a separate database object, and only a few connectivity APIs support calling such stored procedures, anyway. Another method is writing a user defined function to "split" on commas, which will either have poor performance if you write it in T-SQL, or you can go the CLR route and deal with all of its associated issues.

One alternative I've found is providing the criteria in an XML document. SQL Server has very good XML support that is often overlooked, but is useful in many situations like this.

DECLARE @xml xml = N'
    <row field1="abcd" field2="261" />
    <row field1="dfke" field2="378" />
    <row field1="cirp" field2="381" />
    <row field1="vkdp" field2="381" />
    <row field1="boer" field2="381" />
    <row field1="npmr" field2="2638" />
    <row field1="qpci" field2="2638" />
    <row field1="biep" field2="2638" />

SELECT q.field1, q.field2, table1.field3
        n.value('@field1', 'varchar(12)') AS field1,
        n.value('@field2', 'int') AS field2
    FROM @xml.nodes('//row') x(n)) q
LEFT JOIN table1
ON q.field1 = table1.field1 AND
    q.field2 = table1.field2;

Since the XML document is provided to the query in a string, it's parameterizable. It's completely flexible in the kind of data that you can provide; the schema is defined and validated when you shred the XML using T-SQL's XML methods.

XML is often associated with poor performance, but these concerns are overblown unless you're dealing with documents that are tens of thousands of lines or longer, not when you're sending a few hundred lines of input into a parameterized query. But if you are providing a lot of data, you might want to shred the XML into a temporary table so that you can use indexes and a primary key.

Saturday, July 4, 2015

Preserving the current working directory while impersonating with runas

It's pretty common knowledge that you can use the netonly switch of runas to run a program while impersonating a domain user even if the computer you're on is not joined to the domain. E.g.:

runas /netonly /u:domain\user notepad.exe

The command will use your credentials to run the program, but the domain user's credentials when authenticating over the network. It's really handy.

Recently I tried to do this with a program that is sensitive to the current working directory. Unfortunately, runas will set the working directory to %WINDIR%\System32 when impersonating. I discovered that you can work around this issue with the following command line.

runas /netonly /u:domain\user "cmd /c \"cd \"%CD%\" ^& program"

There are few details involved in this command:

  1. The %CD% environment variable expands to the current working directory. Note that this variable expansion takes place before the entire command begins to execute, so it's expanded to the current working directory of the runas command.
  2. To escape quotes in the runas command parameter, you use \". This is only unusual in the sense that all Windows command line programs escape special characters in a different way, and runas is unique in having an ordinary way of escaping quotes.
  3. To run multiple commands in one line, you use an ampersand (i.e. command1 & command2). Since we're passing this as an argument to cmd /c, we need to escape the &. cmd uses the ^ (the caret) to escape ampersands and other special characters.

P.S. Knowing about %CD% is very useful if you use the command line in Windows a lot. For example, if you want to copy the current directory to the clipboard, you can use echo %CD% | clip.

Unrelated but also useful: if you want to copy the full path of some files in the current directory, you can use the command dir /b /s *.txt | clip.

Thursday, July 2, 2015

Removing nondigits from a column in a table using T-SQL

While T-SQL admittedly isn't the greatest language there is for text processing, you shouldn't dismiss it out of hand for simple tasks.

To remove non-digits from a column, you can use PATINDEX and STUFF in a loop. If you're careful to do this in a set-based way, T-SQL should be efficient enough in most situations.

    WITH q AS
        (SELECT table_id, PATINDEX('%[^0-9.]%', column) AS n
        FROM table)
    UPDATE table
    SET column = STUFF(column, q.n, 1, '')
    FROM q
    WHERE table_id = q.table_id AND q.n != 0;


This code should be easy enough to follow. It removes the first nondigit in column in each row of the table. Once it goes through an iteration without finding any nondigits, it exits the loop. This will iterate the maximum number of nondigits in a single field in the table, but it's still inefficient because it runs PATINDEX against each row in the table at each iteration.

To solve that problem, we can maintain a column that specifies whether the column has been scrubbed yet. In my example, "NULL" represents unknown and should be the default value for this field.

DECLARE @done bit = 0;
WHILE @done = 0 BEGIN
    WITH q AS
        (SELECT table_id, PATINDEX('%[^0-9.]%', column) AS n
        FROM table
        WHERE COALESCE(column_scrubbed, 0) = 0)
    UPDATE table
    SET column = STUFF(column, q.n, 1, N''),
        column_scrubbed = 0
    FROM q
    WHERE table_id = q.table_id AND q.n != 0;

    IF @@ROWCOUNT = 0 SET @done = 1;

    -- if column_scrubbed is still NULL, then the PATINDEX
    -- must have given 0
    UPDATE table
    SET column_scrubbed = CASE
        WHEN column_scrubbed IS NULL THEN 1
        ELSE NULLIF(column_scrubbed, 0)

It's a sloppy design to add a column to the schema just to make this operation more efficient. The best approach, in my opinion, is to store intermediate results in a table-valued variable and apply it to the actual table at the end.

-- Find nondigit character in a string
DECLARE @pattern varchar(30) = '%[^0-9.]%';

-- Maps [table_id]s to scrubbed values
    (table_id int PRIMARY KEY,
    column nvarchar(200) NOT NULL,
    scrubbed bit NOT NULL);

-- Initialize current value of [column] for each [table_id]
INSERT INTO @t (table_id, column, scrubbed)
SELECT table_id, column, 0 FROM table;

-- Iterate to remove non-digits from @t.[column], ending when [scrubbed]
-- is set for all rows.
    (SELECT table_id, PATINDEX(@pattern, column) AS n
    FROM @t
    WHERE scrubbed = 0) q
ON [@t].table_id = q.table_id
UPDATE SET column = STUFF(column, q.n, 1, N'')
UPDATE SET scrubbed = 1;

-- Apply the results to permanent table
UPDATE table
SET column =
    (SELECT [@t].column FROM @t
    WHERE [@t].table_id = table.table_id);

Monday, June 1, 2015

Working with time values in Google Sheets Apps Script

Google Sheets stores time internally relative to an epoch of 30 Dec 1899 00:00:00 using the time zone set in the spreadsheet, which is retrievable using getSpreadsheetTimeZone.

Google Apps Script uses JavaScript, which supports datetime values using either local time (with no timezone info attached), or datetimes relative to the Unix epoch 1 Jan 1970 00:00:00+0000.

Working with these two time formats using Google Apps Script code is, predictably, a huge hassle. It doesn't help that the JavaScript Date API is carelessly designed.

If you're working with dates or datetimes (values that have both a date and a time) in Google Sheets, you just create the appropriate JavaScript Date object in local time and use it in a call to setValue(s). The problem is when you're trying to work with cells that contain times only, without a date component.

The first thing you need to do when working with time is get the Google Sheets epoch in JavaScript Date (localtime) format.

function getEpoch() {
    return new Date(Utilities.formatDate(
        new Date('30 Dec 1899 00:00:00'),
        'd MMM yyyy HH:mm:ss Z'));

Then to create a time in Google Sheets format, you would get the epoch using this function and add the appropriate number of milliseconds to get to the time you want.

function makeTime(hour, minute, second) {
    return new Date(getEpoch().getTime() +
        (hour * 3600 + minute * 60 + second) * 1000);

To convert it the other way, you need to get the millisecond offset from midnight and add it to an arbitrary date, because JavaScript Dates must included a date component. I'm using the Unix epoch in local time.

function getTime(v) {
    var u = new Date(v.getTime() - getEpoch().getTime());
    return new Date(1970, 0, 1, u.getUTCHours(), u.getUTCMinutes(),
                    u.getUTCSeconds(), u.getUTCMilliseconds());

Monday, May 25, 2015

Simple profiling in VBA

When you are optimizing for performance, the best approach is concentrate on the code where the most time is being spent. Even if you have a lot of experience, it's usually nearly impossible to guess where these hotspots are, and there's no reason to even attempt that when it's so easy to determine exactly what needs to be optimized by profiling your code.

VBA, of course, doesn't have a built in profiler. But it's really simple to roll your own, since fundamentally the only thing a profiler does is store the elapsed time between events in a data structure.

The simple implementation I made uses a static dictionary variable to store all its data. This is mostly for performance reasons: lookups in a dictionary are very fast. You don't want any part of the profiling code to run slowly, because that would greatly distort your results. You could also do this with a module-level variable, but I felt like having most of the implementation in a single function simplified the API and didn't sacrifice too much readability.

To use this profiler, add a call to Profile paFunctionEnter, "FunctionNameHere" to the start of the functions and subs that you want to measure, and a call to Profile paFunctionExit to all exit points. Since you have to mark the start and end points yourself, you will probably want to avoid adding profiling to trivial functions, which will just clutter up your output. You also might want to add profiling in the middle of a function; I've adopted the convention of prepending an ampersand to internal paFunctionEnters to distinguish them from actual functions in the output.

Enum ProfileAction
End Enum

Enum ProfileFormat
    pfSelfTime = 0
    pfTotalTime = 1
    pfCallCount = 2
End Enum

Public Function Profile(a As ProfileAction, ParamArray args())
    Static sProf As Object
    Dim stack As String
    Dim elapsed As Single
    Dim calls As Variant
    If a <> paStart And sProf Is Nothing Then Exit Function
    If a = paStart Then
        Set sProf = CreateObject("Scripting.Dictionary")
        stack = args(0)
        sProf("_stack") = stack
        sProf("_t") = Timer
        sProf("_calls:" & stack) = 1
        Exit Function
    ElseIf a = paEnd Then
        Set sProf = Nothing
        Exit Function
    End If
    elapsed = Timer - sProf("_t")
    stack = sProf("_stack")
    Select Case a
        Case paFunctionEnter
            sProf(stack) = sProf(stack) + elapsed
            sProf("_stack") = stack & "/" & args(0)
            sProf("_parent:" & sProf("_stack")) = stack
            stack = sProf("_stack")
            calls = sProf("_calls:" & stack)
            If IsNull(calls) Then
                sProf("_calls:" & stack) = 1
                sProf("_calls:" & stack) = calls + 1
            End If
        Case paFunctionExit
            sProf(stack) = sProf(stack) + elapsed
            stack = sProf("_parent:" & stack)
            sProf("_stack") = stack
        Case paShowResults
            ProfileResults sProf, CLng(args(0))
    End Select
    sProf("_t") = Timer
End Function

Private Sub ProfileResults(prof As Object, fmt As ProfileFormat)
    Dim k As Variant
    Dim k2 As Variant
    Dim fmt_label As String
    For Each k In prof.Keys
        If Left(k, 1) <> "_" Then
            prof("_self:" & k) = prof(k)
            prof("_total:" & k) = 0!
            For Each k2 In prof.Keys
                If Left(k2, Len(k)) = k Then
                    prof("_total:" & k) = prof("_total:" & k) + prof(k2)
                End If
        End If
    fmt_label = Array("_self:", "_total:", "_calls:")(fmt)

    For Each k In prof.Keys
        If Left(k, Len(fmt_label)) = fmt_label Then
            Debug.Print Mid(k, Len(fmt_label) + 1) & ": " & prof(k)
        End If
End Sub

Example usage:

' Add calls to Profile paFunctionEnter and Profile paFunctionExit
' to the functions and subs that you are measuring.
Public Sub ProfilingExample()
    Dim x As Variant
    Profile paStart, "Test"
    x = SlowFunction(arg1, arg2, arg3)
    Profile paFunctionExit ' exit top level "Test" pseudofunction
    ' Profile paShowResults, pfCallCount
    ' Profile paShowResults, pfSelfTime
    Profile paShowResults, pfTotalTime
    Profile paEnd
End Sub

Here's what the output looks like. Since I called paShowResults with pfTotalTime, the values are the total seconds spent in each function. You can also get the number of calls and the "self" time, which doesn't include the time spent in sub-calls that are also profiled.

Test: 1.382813
Test/Filtered: 1.382813
Test/Filtered/GetScratchWorksheet: 0
Test/Filtered/CopyValues: 1.179688
Test/Filtered/CopyValues/FirstRow: 0
Test/Filtered/CopyValues/FirstCol: 0
Test/Filtered/CopyValues/SqueezeRange: 1.179688
Test/Filtered/CopyValues/SqueezeRange/FirstRow: 0
Test/Filtered/CopyValues/SqueezeRange/FirstCol: 0
Test/Filtered/CopyValues/SqueezeRange/SqueezeRangeHelper: 1.179688
Test/Filtered/CopyValues/SqueezeRange/SqueezeRangeHelper/&ShiftArea: 1.148438
Test/Filtered/CopyValues/SqueezeRange/SqueezeRangeHelper/&ShiftArea/LastCol: 0
Test/Filtered/CopyValues/SqueezeRange/SqueezeRangeHelper/&ShiftArea/&RowCollision: 0.6015625
Test/Filtered/CopyValues/SqueezeRange/SqueezeRangeHelper/&ShiftArea/&RowCollision/LastCol: 0.234375
Test/Filtered/CopyValues/SqueezeRange/SqueezeRangeHelper/&ShiftArea/&RowCollision/LastRow: 0.2578125
Test/Filtered/CopyValues/SqueezeRange/SqueezeRangeHelper/&ShiftArea/LastRow: 0
Test/Filtered/CopyValues/SqueezeRange/SqueezeRangeHelper/&ShiftArea/&ColumnCollision: 0.515625
Test/Filtered/CopyValues/SqueezeRange/SqueezeRangeHelper/&ShiftArea/&ColumnCollision/LastRow: 0.125
Test/Filtered/CopyValues/SqueezeRange/SqueezeRangeHelper/&ShiftArea/&ColumnCollision/LastCol: 0.203125
Test/Filtered/CopyValues/LastRow: 0
Test/Filtered/CopyValues/LastCol: 0

Thursday, May 14, 2015

Convert Excel column number to letters and vice versa

Since I haven't written on here for a while, I thought I'd put up a quick post with some code to convert an Excel column letter reference to a number or the reverse: convert column letters to a number. This is a good candidate for your Excel utility module.

This function uses recursion, but there isn't any risk of stack overflow because there will at most log26(n) recursions, where n is the column number. It also uses the relatively unknown VBA \ operator, which does integer division (rounding down). Input validation is left as an exercise. I don't do any because I only ever call this function with constant values, which are pretty hard to get wrong.

' converts column letter to number or vice versa
Public Function Col(c As Variant) As Variant
    Dim x As Variant
    If VarType(c) = vbLong Or VarType(c) = vbInteger Then
        x = Chr(Asc("A") + (c - 1) Mod 26)
        If c > 26 Then
            Col = Col((c - 1) \ 26) & x
            Col = x
        End If
    ElseIf VarType(c) = vbString Then
        x = Asc(Right(c, 1)) - Asc("A") + 1
        If Len(c) > 1 Then
            Col = x + Col(Left(c, Len(c) - 1)) * 26
            Col = x
        End If
        Col = Null
    End If
End Function

Excel versions 2007 and later support 16,384 columns, so a good test is with column 16,384, which you can verify is XFD by opening a new Workbook and pressing Ctrl+Right.

' Executed in the immediate window


Friday, May 8, 2015

Quickly set an Excel range to a literal array of values

You might know already that the best way to set the values of a Range in Excel is use a Variant array containing the desired values, keeping in mind that the most efficient way to write VBA code is to minimize calls to the Excel API, so, of course, setting the cells one at a time is completely out of the question.

' Method one (BAD): set the cells one at a time.
Range("A1") = "a"
Range("B1") = "b"
Range("C1") = "c"
Range("A2") = 1
Range("B2") = 2
Range("C2") = 3

' Method two (BETTER): set the cells using a Variant array
Dim x(1 To 2, 1 To 3)
x(1, 1) = "a"
x(1, 2) = "b"
x(1, 3) = "c"
x(2, 1) = 1
x(2, 2) = 2
x(2, 3) = 3
Range("A1:C2") = x

Since there (seems to be) no way to create a literal 2D array in VBA, you might think that this slightly verbose style is the only option. But there is an alternative to do the same thing, which takes advantage of Excel array constant notation:

[A1:C2] = [{"a", "b", "c"; 1, 2, 3}]

You can use a similar technique to set formulas for several columns in a single API call.

[OFFSET(D2:E2,0,0,COUNTA(A:A)-1)] = Array( _
    "=VLOOKUP($A2,Sheet2:$A:$C,2,FALSE)", _

The offset formula creates a range starting at D2:E2 and extends the row count using the number of non-blank cells in column A. This won't work so well if each of your table columns may have blank cell gaps, so it's a bit more brittle than the previous trick, but is still a useful technique if you have control over the data that you're processing.

Thursday, May 7, 2015

Excel get autofilter row count

After having this question myself and doing a little bit of searching, I've found that there's a lot of misinformation out there, so I'm going to post a reliable way to get the number of visible records in an Excel autofiltered worksheet.

Public Function GetRecordCount(ws As Worksheet) As Long
    GetRecordCount = ws.AutoFilter.Range.Columns(1).SpecialCells( _
        xlCellTypeVisible).Count - 1
End Function

This is the most common incorrect solution to this problem I've come across:

' This code is incorrect.
row_count = filter.Range.SpecialCells(xlCellTypeVisible).Rows.Count

The Rows property does not work correctly on Ranges that have multiple Areas, such as a range returned from the SpecialCells method. The property only returns the rows in the first Area of the Range, so you will only get a row count of the first contiguous range of the filtered data. E.g., if rows 1, 2, 5, 6, 7, and 8 are visible, the result would be 2, since rows 1 and 2 comprise the first contiguous range.

The other approach that I've seen uses a for loop to count rows. As I've discussed before, in VBA you always want make as few calls to the Excel API as possible. You never want to operate on an Excel worksheet one row at a time when the API exposes a method to get the same result using a single call, because operating one row or cell at a time will kill your macro performance when you are processing a lot of data.

Friday, May 1, 2015

Excel changing data validation based on cell value

In Excel, it doesn't seem like you can use a different type of list validation based on changing values in another column. One common workaround is to just define completely different validation rules for different ranges in the same column. This is a mistake, because if you sort or copy the cells, the data validation does not travel with them and you end up with validation rules being applied to the wrong cells.

There is a trick, though. You need to use a formula to define the range that will give you the list values that the validation applies to. For example, if column B defines which validation rule to apply, and you have the list values for each validation rule in the lookup tab, with each column B value in the range A1:D1, and the possible cell values in a list below each B value, you could use a formula like this as the list validation formula, which you are applying to the entire column A.

Experienced Excel users might be questioning my use of the CHOOSE function here, and wondering why I didn't use OFFSET. Usage of OFFSET (and related functions like INDIRECT) should always be avoided if possible. Since Excel's calculation engine cannot determine the cells that an OFFSET formula depends on, every formula that uses OFFSET will always be recalculated when anything in the workbook changes. This can have a very negative impact on spreadsheet performance. However, if you have a large number of different validation rules, the OFFSET version might be preferable, and would look like this.

Tuesday, April 28, 2015

SSH tunneling trick: using localhost subdomains

Here's a little trick I just discovered which is useful if you do a lot of tunneling over ssh.

You can actually bind your local port tunnels to their own local address, instead of just using the default of localhost.

The first step is to define a localhost subdomain in /etc/hosts. (On Windows, this file is located at %windir%\System32\drivers\etc\hosts). Bind it to a local IP address other than (you have free reign over the last three numbers, as long as the IP starts with 127).

Next, create your ssh tunnel.

ssh -L 1

Now try opening in a browser on your local machine. The entire page will not load, because loads images and other assets from domains other than, but enough should load to let you know that the tunnel is working.

This technique is useful for remembering your ssh tunnels if you have a lot of them (like me). It's a lot easier to remember than it is to remember http://localhost:6103. It's also useful if you need to tunnel a program that is finicky about which port it connects to.

If you're having trouble getting this to work, make sure you're not running a local web server that has bound itself to port 80 on all of your 127.* addresses. If you are, either disable your web server, or try this experiment using a different service, such as RDP (port 3389) or SQL Server (port 1433).

1I'm actually using PuTTY, but have translated it to the regular SSH command for wider applicability. The same forwarding will work in PuTTY, even though it looks like it wouldn't because of the narrow local port text field in the settings.

Friday, April 24, 2015

AddToArray VBA function

I'm going to discuss a function to add an element to a VBA Variant array: AddToArray.

In VBA, arrays have a fixed size. If you've already sized an array, while it's possible to resize it to add new items, doing so requires a couple lines of code with complicated indexing operations.

' Add another element to x.
ReDim Preserve x(UBound(x) + 1)
x(UBound(x)) = "d"

Most people might not think that this is terrible in terms of complexity, but it's a common enough operation that I think it would be good to add a Sub to our utility module that adds an element to an array in a generic way.

It's worth noting that resizing an array in this manner has very bad performance if you are working with large arrays, especially compared to alternate data structures that are widely available in other languages. ReDim Preserve will allocate an entirely new array with the new size, copy the old array to the new array, and finally discard the old array. Not only does this make resizing an array an O(n) operation, doing it several times generates a lot of garbage. Because of this, if there's a way to pre-identify the necessary size of an array, it's always preferable to size it to the correct length beforehand instead of adding to it one element at a time.

VBA does offer an efficient, growable collection type (aptly named Collection), but this class has several shortcomings. It has a strange API. It's 1-indexed. It's impossible to modify items that have been stored in a collection without removing and re-adding them. It only can store Variants; there's no type safety. And it's non-standard. Although this isn't so much of an issue if you're chiefly an Excel developer, there is no support for Collection objects in VBScript or VB6, and because of this, standard COM and ActiveX objects will only deal with arrays. Even APIs in Excel in general prefer to use arrays: for example, if you want to assign cell values to multiple cells in a range at once, you'll be doing so with an array.

Here is the AddToArray Sub. I'm not going to go into too much discussion of the details here, as I'm still in the middle of my series on ArraySearch.

Public Sub AddToArray(ByRef av As Variant, v As Variant)
    Dim array_size As Long
    Dim err_info As Variant
    If IsArray(av) Then
        ' Normally I hate using error handlers for an ordinary flow of
        ' execution, but there's no other option in VBA to test for an
        ' uninitialized, empty Array when it gets passed to a Sub as a
        ' Variant.
        ' If the parameter had been declared with "av() As Variant",
        ' "IsArray(av) And Not av" would detect uninitialized arrays.
        ' but then only variables explicitly declared as arrays would be
        ' passable without causing a compile-time error.
        On Error Resume Next
        array_size = UBound(av)
        err_info = Array(Err.Number, Err.Source, Err.Description)
        On Error GoTo 0
        If err_info(0) = 9 Then
            ReDim av(0)
            av(0) = v
        ElseIf err_info(0) <> 0 Then
            ' Re-raise other errors
            Err.Raise err_info(0), err_info(1), err_info(2)
            ReDim Preserve av(LBound(av) To array_size + 1)
            av(UBound(av)) = v
        End If
    ElseIf IsEmpty(av) Then
        av = Array(v)
        av = Array(av, v)
    End If
End Sub

And here are the tests for AddToArray.

#If False Then
' basTemp @UnitTests[start]
Private Sub TestAddToArray()
    ' add Long to empty non-array Variant
    Dim x As Variant
    AddToArray x, 1&
    AssertType x, "Variant()"
    AssertEqual Join(x, ","), "1"
    AssertType x(0), "Long"
    ' add String to singleton array
    AddToArray x, "Hello"
    AssertEqual Join(x, ","), "1,Hello"
    ' test adding problematic variant types to arrays
    AddToArray x, CVErr(1)
    AssertEqual x(2), CVErr(1)
    AddToArray x, Null
    AssertIsNull x(3)
    AddToArray x, Array(3, 6)
    AssertEqual UBound(x(4)), 1&
    AssertEqual x(4)(1), 6
    ' Adding to fixed-length arrays is not supported.
    Dim y(2) As String
    AddToArray y, -1
    AssertError 10 ' This array is fixed or temporarily locked

    ' Test that adding to staticly typed arrays retains the type.
    Dim z() As String
    AddToArray z, "Hello"
    AssertType z, "String()"
    AssertType z(0), "String"
    ' Adding to a non-array value should convert the value to a
    ' Variant array containing the original value and new value.
    Dim a As Variant
    a = "Non-array"
    AddToArray a, 3
    AssertType a, "Variant()"
    AssertEqual UBound(a), 1&
    AssertEqual a(0), "Non-array"
    AssertEqual a(1), 3
    ' Multidimensional arrays
    Dim b() As Variant
    ReDim b(1, 1)
    AddToArray b, 3
    AssertError 9 ' Subscript out of range
End Sub
' basTemp @UnitTests[end]
#End If

Thursday, April 23, 2015

In defense of ScreenUpdating = True

Most people who write a lot of Excel VBA code settle into a pattern of always including this line of code at the beginning of every macro of any significance.

Application.ScreenUpdating = False

This will cause Excel to stop painting the screen, which means that the Excel window will not repaint for a few seconds or minutes while the macro runs.

If you were to ask a VBA programmer why she or he does this, the response you would get would likely be along the lines of "it makes the macro run faster", or possibly "the screen jumps around too much if you let it paint". I don't think either of these reasons holds water, and to get to the real reason, you have to think back to what things were like in the early days of VBA.

VBA has been around for a very, very long time. The first version of Excel to have something resembling modern VBA was Excel 5.0, in 1993. 20 years ago, computer hardware was very different than what it is now. Even with the best optimization, most computers could not play sophisticated games or video in a window larger than a postage stamp. Windows was different back then, too: the video code prioritized flexibility and abstraction at the expense of performance. Combine these two factors, and it's easy to see why ScreenUpdating = False came into vogue: in 1993, repainting the screen was indeed an enormous performance hit, even if your macro wasn't making changes that caused big repaints.

Today, on the other hand, is very different. And I think there are good reasons to let the screen repaint: your macro won't look like it has crashed, your users will be able to see the macro doing its work, and, as a programmer, watching progress of your code on the screen can be an invaluable tool for finding out if something is going wrong or if an improvement can be made.

If you actually are hitting a performance bottleneck by leaving screen repaints on, 99% of the time, it's good indication that your macro is doing something inefficiently. Every action you take in Excel should affect as many cells as possible. You should avoid doing little changes "one at a time" and concentrate on making just a few big changes all at once.

For example, instead of setting the fill color of a column one cell at a time in a loop, get a Range object for all the cells that you need to change and set the fill color for the range, just once.

Or, instead of deleting one row at a time when the cell in a certain column matches a certain value, sort the table so that the rows to be deleted are all together, and delete them all at once. That will convert hundreds or even thousands of steps into just two. You will notice a huge speed boost, even if you compare the old macro with screen updating off to the new macro with it on.

In my opinion, the only reason to use ScreenUpdating = False is when you are manipulating a sheet with a lot of embedded shapes and charts and your macro is causing a distracting flicker. In every other situation, if you are running into a performance problem, take a careful look at your code and make sure that you're not doing something in a stupidly inefficient way.

Monday, April 20, 2015

ArraySearch for VBA: Part 2

Last time, we implemented a simple VBA function to search an array for a value, returning a boolean whether it was found or not, and the index where it was found in an optional parameter.
Public Function ArraySearch(a As Variant, x As Variant, _
                            Optional ByRef found_index As Variant) As Boolean
    Dim i As Long
    Dim result As Boolean
    result = False
    For i = LBound(a) To UBound(a)
        If a(i) = x Then
            result = True
            found_index = i
            Exit For
        End If
    ArraySearch = result
End Function

Unfortunately, this implementation failed two of our original unit tests.

Private Sub TestArraySearchSpecial()
    Dim avar As Variant
    ' ut6 searching non-array
    AssertFalse ArraySearch("two", 2)
    ' ut7 searching for array within array
    avar = Array(1, Array(2, 3), 4, 5)
    Assert ArraySearch(avar, Array(2, 3))
End Sub

Ut6 exposes a simple bug. We have to add an early test that the parameter a is really an array before attempting to index it.

' Add this code to the beginning of ArraySearch.
If Not IsArray(a) Then Err.Raise 13, , "Parameter a is not an array."
Ut7 is more complicated. The VBA equality operator = does not support comparing two arrays. In order to compare two arrays for equality, the array elements would need to be compared, one at a time. Such a comparison would also need to be recursive, since there's nothing stopping the arrays from containing sub-array elements themselves.

We have to step back and evaluate whether the benefit outweighs the complexity. The goal is to write a generalized array search algorithm, and if the function is to be as general as possible, supporting searching for a sub-array seems like it's something that we would want to do. On the other hand, in my experience, this is not a scenario that comes up very often. I rarely get to the point where I have a data structure complicated enough to involve arrays within arrays. Before things have gotten to that level of complexity, I either have moved the data into a SQL database, or have developed a properly encapsulated data model using user-defined types and classes. Arrays in VBA are best used to store a homogeneous list of data or (in a pinch) to define a small, limited-use data structure, similar to a tuple in Python. Implementing a complex data model using arrays, sub-arrays, and perhaps a Dictionary or two will quickly lead you to an unmaintainable mess of spaghetti code.

Using that reasoning, I've made the decision that I will not support searching for sub-arrays. We still should check for sub-arrays in the input so that this edge case is properly accounted for. This test can be combined with the earlier check.

' Add this code to the beginning of ArraySearch.
If Not IsArray(a) Then
    Err.Raise 13, , "Parameter a is not an array."
ElseIf IsArray(x) Then
    Err.Raise 13, , "Searching for a sub-array is not supported."
End If

While we're at it, let's update the unit tests to accommodate our revised design.

Private Sub TestArraySearchSpecial()
    Dim avar As Variant
    Dim result As Boolean
    ' ut6 searching non-array
    On Error Resume Next
    result = ArraySearch("two", 2)
    AssertEqual Err.Number, 13
    On Error GoTo 0
    ' ut7 searching for array within array
    avar = Array(1, Array(2, 3), 4, 5)
    On Error Resume Next
    result = ArraySearch(avar, Array(2, 3))
    AssertEqual Err.Number, 13
    On Error GoTo 0
End Sub

We're not done yet, though. ArraySearch still has problems. A final form will be presented in tomorrow's post, but before that, here are some more failing unit tests. As an exercise, see if you can guess what causes these tests to fail and how the code will need to be updated to deal with them.

Private Sub TestArraySearchTypes()
    ' ut8 Search for vbObject
    Assert ArraySearch(Array(1, 2, 3, ThisWorkbook), ThisWorkbook)

    ' ut9 Searches involving Empty, Null, and vbObjects
    Assert ArraySearch(Array(1, 2, Empty, Sheets(1), Null, 4), 4)
    AssertFalse ArraySearch(Array(1, 2, Empty, vbNullString), Null)
    ' ut10 ParamArray searches
    TestArraySearchParamArray 1, "a", Null, Sheets(1), "arg1"
End Sub

Private Sub TestArraySearchParamArray(ParamArray params() As Variant)
    Assert ArraySearch(params, "arg1")
    AssertFalse ArraySearch(params, vbNullString)
End Sub

Private Sub TestArraySearchCollection()
    Dim c As New Collection

    ' ut11 Support for Collections? Why not?
    Assert ArraySearch(c, "cdef")
    Assert ArraySearch(c, 1)
    AssertFalse ArraySearch(c, Empty)

Friday, April 17, 2015

A robust implementation of ArraySearch for VBA: Part 1

VBA doesn't include a built-in function for searching an Array. Because of this, after doing work in VBA for a while, you'll probably have written several loops like this.

Dim i As Long
Dim found As Boolean

For i = 0 To UBound(a)
    If a(i) = v Then
        found = True
        Exit For
    End If

If found Then
    Debug.Print "Value was found at index " & i
    Debug.Print "Value is missing"
End If

You might think that this is quite a bit of code for what should be a simple operation. A huge element of good programming style is to eliminate repeated logic and abstract it into separate procedures. VBA, with its out of date, weirdly designed standard library, provides us with several opportunities to practice this aspect of software design.

I think that most VBA programmers should put together their own "utility" module that they include in any non-trivial projects to eliminate the standard, boilerplate code that they find themselves writing often. As you might have guessed by now, this post is going to focus on writing a standard ArraySearch function that would fit perfectly in such a utility module.

First attempt at ArraySearch

This is our first attempt. We're going to try just converting the code in the snippet above into a Function.

' Search the array [a] for value [x], returning whether x was found.
' If [found_index] is provided and [x] is found, [found_index] receives the
' index where [x] was found.
Public Function ArraySearch(a As Variant, x As Variant, _
                            Optional ByRef found_index As Variant) As Boolean
    Dim i As Long
    Dim result As Boolean

    result = False
    For i = LBound(a) To UBound(a)
        If a(i) = x Then
            result = True
            found_index = i
            Exit For
        End If

    ArraySearch = result
End Function

I changed the indexing from the prior version to go from LBound to UBound in case someone passes in an array with a base other than zero, which is possible when declaring an array or by using the Option Base directive. Other than that, it's pretty similar.

As an exercise, try reading this function carefully and make a prediction of what kinds of input it will fail to process.

Let's do some unit testing, with a focus on VBA's many different possible kinds of arrays.

Private Sub TestArraySearchFixed()
    Dim afixed(4) As String
    Dim found As Long
    afixed(0) = "alpha"
    afixed(1) = "bravo"
    afixed(2) = "1"
    afixed(3) = "two"
    afixed(4) = "3"
    ' ut1 successful/unsuccessful searches on fixed array
    ' unsuccessful search should preserve found_index
    Assert ArraySearch(afixed, "two", found)
    AssertEqual found, 3&
    AssertFalse ArraySearch(afixed, 2, found)
    AssertEqual found, 3&
End Sub

Private Sub TestArraySearchDynamic()
    Dim adynamic() As Long

    ReDim adynamic(3)
    adynamic(0) = 78503
    adynamic(1) = 65075
    adynamic(2) = 21738
    adynamic(3) = 43044
    ' ut2 unsuccessful/sucessful search on dynamic array
    AssertFalse ArraySearch(adynamic, 44444)
    Assert ArraySearch(adynamic, 78503)
End Sub

Private Sub TestArraySearchSpecial()
    Dim abase(1 To 2) As Variant
    Dim asingleton(0) As Integer
    Dim aempty As Variant
    Dim avar As Variant

    abase(1) = "abc"
    abase(2) = 123&
    asingleton(0) = -1
    aempty = Array()
    ' ut3 search on a singleton array (found and not found)
    Assert ArraySearch(asingleton, -1)
    AssertFalse ArraySearch(asingleton, -2)
    ' ut4 search on empty array
    AssertFalse ArraySearch(aempty, vbNullString)
    ' ut5 search on homogeneous variant array (found and not found)
    avar = Array(1, 2, 3, 4)
    Assert ArraySearch2(avar, 3)
    AssertFalse ArraySearch(avar, 0)
    ' ut6 searching non-array
    AssertFalse ArraySearch("two", 2)
    ' ut7 searching for array within array
    avar = Array(1, Array(2, 3), 4, 5)
    Assert ArraySearch(avar, Array(2, 3))
End Sub

Unit tests 1 through 6 pass okay, but we run into problems on the last two.

Why did the tests ut6 and ut7 fail? How are we going to fix the function so that the tests pass? Could there be other problems with our ArraySearch function that we haven't even tested for yet? Find on Monday, in the next blog post!

Converting Variant types in Excel VBA

As part of an Excel project, I needed to write some code that converted Variants in a reliable way.

The project involved calling a stored procedure that took parameters of many different types which were provided via cells in a spreadsheet. An Excel spreadsheet cell can contain values in a variety of different data types. When you access a cell value with VBA code, it comes to wrapped in a Variant type. Essentially, I needed to be able to convert data from a very forgiving type system (spreadsheet cells) to a very strictly typed system (a stored procedure parameter in a SQL Server database). And since the data was being supplied by non-technical end-users, I had to be sure that unusual Variant types didn't cause issues with data integrity or summon the dreaded VBA error dialog box. For example, the Variant type vbError, which is used by error cell values such as #NA or #NAME, is infamous for causing problems with VBA macros. You'll see why this is in a future post when I discuss Variant comparisons.

To solve the problem, my approach was to write a function that takes a Variant type (that could be any sub-type) and returns a Variant of the desired sub-type, converting it if possible, choosing a logical substitution value if not, and only raising an error when there is no other option.

In preparation, I wrote some code to build a Variant type conversion chart, which shows whether a subtype can be converted to any other subtype, and any errors that may be generated when the conversion is attempted.

Most of the results in this chart are obvious. It's to be expected that just about every type can be converted to a string or that numbers will sometimes overflow when converting to narrower numeric types. Some results are surprising, such as the inability to convert a vbError to a vbDate or a vbDecimal, even though it can be converted to just about every other type.

The conversions for vbEmpty and vbNull are instructive. The Empty value in VBA represents the default value, so it makes sense that it can be converted into anything else. When you convert Empty to a subtype, it turns into the default value for that subtype: for example, an empty string or the integer 0. The Null value, on the other hand, is intended by VB's designers to be an "intentionally missing" value, and so it's a mistake to coerce it into a data containing type. Thankfully, a spreadsheet cell cannot contain a null value. Null values behave similarly as in SQL in comparisons, and Null is considered Falsey in a VBA If statement.

Here is the variant conversion function that I ended up writing. Remember, the intention was to convert a variant value from any subtype to a defined variant subtype, and to be as forgiving as possible in the process.

Const CustomError = 2742

' Convert Variant [x] to subtype [rtype], returning [if_null] if conversion is
' impossible, otherwise returns a converted Variant value.
' Additional parameters are values that should be treated as null.
Public Function VConvert(rtype As VbVarType, _
                         x As Variant, _
                         if_null As Variant, _
                         ParamArray nullable_values() As Variant) As Variant
    Dim result As Variant

    ' Raise an error if provided with an unsupported type.
    Select Case rtype
        Case vbEmpty, vbNull, vbInteger, vbLong, vbSingle, vbDouble, _
            vbCurrency, vbDate, vbString, vbError, vbBoolean, vbVariant, _
            vbDecimal, vbByte
            ' Guard against Case Else if a valid type is provided.

        Case vbObject, vbDataObject, vbLongLong, vbUserDefinedType, vbArray
            Err.Raise CustomError + 1, , "Invalid result type: " & rtype

        Case Else
            Err.Raise CustomError + 1, , "Unknown result type: " & rtype
    End Select
    If VarType(x) = vbString Then
        result = Trim(x)
        result = x
    End If
    If ArraySearch(nullable_values, x) Then
        result = if_null
    ElseIf IsEmpty(result) Or IsNull(result) Or result = vbNullString Then
        result = if_null
        ' Attempt conversion and return if_null if an error is raised.
        On Error GoTo ErrHandler
        Select Case rtype
            Case vbEmpty: result = Empty
            Case vbNull: result = Null
            Case vbInteger: result = CInt(result)
            Case vbLong: result = CLng(result)
            Case vbSingle: result = CSng(result)
            Case vbDouble: result = CDbl(result)
            Case vbCurrency: result = CCur(result)
            Case vbDate
                If IsDate(result) Then
                    result = DateValue(CDate(result))
                    result = if_null
                End If
            Case vbString: result = CStr(result)
            Case vbError: result = CVErr(result)
            Case vbBoolean: result = CBool(result)
            Case vbVariant: result = CVar(result)
            Case vbDecimal: result = CDec(result)
            Case vbByte: result = CByte(result)
        End Select
        On Error GoTo 0
    End If
    VConvert = result
    Exit Function

    VConvert = if_null
End Function

And here are the unit tests for this function.

Private Sub TestVConvert()
    Dim vDate As Variant
    Dim vResult As Variant
    vDate = DateSerial(1982, 5, 22) + TimeSerial(5, 30, 0)
    ' Test time truncation
    AssertEquals VConvert(vbDate, vDate, Null), #5/22/1982#
    ' Test conversion of vbDate to vbString
    AssertEquals VConvert(vbString, vDate, Null), "5/22/1982 5:30:00 AM"
    ' Test nullable ParamArray
    AssertIsNull VConvert(vbString, "Hello", Null, "a", "b", "c", "Hello", "d")
    ' Test conversion that causes overflow
    AssertEquals VConvert(vbInteger, 999999, -1), -1
    ' Test that the overflow Err was cleared
    AssertEquals Err.Number, 0&
    ' Test vbLong to vbInteger regular conversion
    AssertEquals VConvert(vbInteger, 100&, -1, 99.9), 100
    ' Test string trimming during conversion
    AssertEquals VConvert(vbString, "   abcd  ", Null), "abcd"
    ' Test number conversion with untrimmed string
    AssertEquals VConvert(vbDecimal, "   42.25 ", Null), CDec(42.25)
    ' Test that conversions retain type of is_null parameter
    AssertType VConvert(vbDate, "Not a date", 42&), "Long"
    ' Test that Nulls get converted to is_null
    AssertEquals VConvert(vbInteger, Null, -1), -1
    ' Test that Emptys get converted to is_null
    AssertEquals VConvert(vbLong, Empty, 50&), CLng(50)
    ' Test that vbNullStrings get converted to is_null
    AssertEquals VConvert(vbString, "", "[empty]"), "[empty]"
    ' Test that converting to invalid type raises error
    On Error Resume Next
    vResult = VConvert(vbObject, vStr, Null)
    AssertEquals Err.Number, CustomError + 1
    On Error GoTo 0
End Sub

You may have noticed a call to an unimplemented function called "ArraySearch". VBA provides no built-in function for searching an array, so a robust implementation should be a part of any serious VBA programmer's toolbox. Implementing ArraySearch in a way that intelligently handles heterogeneous arrays of multiple subtypes without producing errors or incorrect result is not trivial and will be a topic for a future post.

Thursday, April 16, 2015

New blog

I'm going to be posting here about my programming work, which is mostly in Excel VBA and T-SQL these days. I might occasionally write about Android development, but most likely this will become one of the abandoned blogs that you see that have three posts from 5 years ago and were never touched again.