Thursday, July 2, 2015

Removing nondigits from a column in a table using T-SQL

While T-SQL admittedly isn't the greatest language there is for text processing, you shouldn't dismiss it out of hand for simple tasks.

To remove non-digits from a column, you can use PATINDEX and STUFF in a loop. If you're careful to do this in a set-based way, T-SQL should be efficient enough in most situations.

WHILE 1 = 1 BEGIN
    WITH q AS
        (SELECT table_id, PATINDEX('%[^0-9.]%', column) AS n
        FROM table)
    UPDATE table
    SET column = STUFF(column, q.n, 1, '')
    FROM q
    WHERE table_id = q.table_id AND q.n != 0;

    IF @@ROWCOUNT = 0 BREAK;
END;

This code should be easy enough to follow. It removes the first nondigit in column in each row of the table. Once it goes through an iteration without finding any nondigits, it exits the loop. This will iterate the maximum number of nondigits in a single field in the table, but it's still inefficient because it runs PATINDEX against each row in the table at each iteration.

To solve that problem, we can maintain a column that specifies whether the column has been scrubbed yet. In my example, "NULL" represents unknown and should be the default value for this field.

DECLARE @done bit = 0;
WHILE @done = 0 BEGIN
    WITH q AS
        (SELECT table_id, PATINDEX('%[^0-9.]%', column) AS n
        FROM table
        WHERE COALESCE(column_scrubbed, 0) = 0)
    UPDATE table
    SET column = STUFF(column, q.n, 1, N''),
        column_scrubbed = 0
    FROM q
    WHERE table_id = q.table_id AND q.n != 0;

    IF @@ROWCOUNT = 0 SET @done = 1;

    -- if column_scrubbed is still NULL, then the PATINDEX
    -- must have given 0
    UPDATE table
    SET column_scrubbed = CASE
        WHEN column_scrubbed IS NULL THEN 1
        ELSE NULLIF(column_scrubbed, 0)
    END;
END;

It's a sloppy design to add a column to the schema just to make this operation more efficient. The best approach, in my opinion, is to store intermediate results in a table-valued variable and apply it to the actual table at the end.

-- Find nondigit character in a string
DECLARE @pattern varchar(30) = '%[^0-9.]%';

-- Maps [table_id]s to scrubbed values
DECLARE @t TABLE
    (table_id int PRIMARY KEY,
    column nvarchar(200) NOT NULL,
    scrubbed bit NOT NULL);

-- Initialize current value of [column] for each [table_id]
INSERT INTO @t (table_id, column, scrubbed)
SELECT table_id, column, 0 FROM table;

-- Iterate to remove non-digits from @t.[column], ending when [scrubbed]
-- is set for all rows.
WHILE @@ROWCOUNT > 0
MERGE INTO @t USING
    (SELECT table_id, PATINDEX(@pattern, column) AS n
    FROM @t
    WHERE scrubbed = 0) q
ON [@t].table_id = q.table_id
WHEN MATCHED AND n > 0 THEN
UPDATE SET column = STUFF(column, q.n, 1, N'')
WHEN MATCHED AND n = 0 THEN
UPDATE SET scrubbed = 1;

-- Apply the results to permanent table
UPDATE table
SET column =
    (SELECT [@t].column FROM @t
    WHERE [@t].table_id = table.table_id);

No comments:

Post a Comment