Thursday, September 17, 2020

Converting Manually-Entered Strings to Dates

Before computers, the British, US and Canadians all entered dates with the full month-name first, followed by the day number, followed by an ordinal suffix (1st, 2nd, 3rd and 4th etc) followed by the year. It was generally the full four-figure year. I know all this because I’ve just been converting many thousands of these dates from transcribed historical records.  They are the dates from a whole collection of the transcriptions of old newspaper stories going back over a couple of hundred years. At first glance, I reckoned that there would be a good chance that the Convert/try_convert function would deal with this. It does so up to a point. 

Select sum(case when try_convert(Date, [StoryDate],113) is null then 0 else 1 end ), count(*)
FROM dbo.NewspaperStories
-- 125 out of 24708 stories isn't exactly success. what's happening?

It doesn’t mind the month being spelled out in full, and is relaxed about the month preceding the day, but it shies like a frightened horse at the suffix. We can produce the format, we just can’t read it.

SELECT Replace(
                Format(GetDate(), 'MMMMM ~ yyyy', 'en-gb'),
                '~',
                Convert(VARCHAR(2), DatePart(DAY, GetDate()))
                + Substring('stndrdthththththththththththththththththstndrdthththththththst',
                             (DatePart(DAY, GetDate()) * 2) - 1, 2 )
              ) AS Date;

Aha! you think, just remove these ordinal suffixes. Not so fast. try to remove ‘st ‘  using the Replace() function and ‘August’ becomes ‘Augu’. You spot that some dates just don’t have a day, because they apply to the whole month. Also, you find that due to typing errors, and the randomness of human behaviour at times, stuff creeps in to manually-entered dates such as dots instead of spaces and appendages to the year. Suddenly, you are going for a drive down complexity avenue, a route with little hope of a U turn. You are faced with either correcting the source (sorry, too many) or producing a string created from the data elements surgically extracted from the slightly erratic version, that can then be passed to the Convert function to deal with. 

One of the joys of having  26,000 date strings to deal with is that you have plenty of test data for a function.  Curiously, the combination of PatIndex() and Substring() can be quite powerful. The trick is to provide defaults. You have to ensure that PatIndex always succeeds, by providing a default that passes the test attached to the end of the string. You need to add a space to the beginning to deal with untrimmed strings. 

It’s time to create a simple test harness and work away until  all 26,000 dates can be parsed. There were a very few that had to be corrected at source, but otherwise it went well.

SELECT top 100 TheDay+' '+The_Month+' '+The_Year as corrected_Date, fdate
from (Select 
         --look for, and extract, a number between 1 and 39, by looking for the number with either a space or a th,st, nd or rd after it
         --include a leading space just in case the date starts with the day number and add a default
         --so that if no suitable string is found, the default is the 1st of the month
         --occasionally you get a false positive for the date coming from the year so we out in
         --some trick code using '~' to prevent it happening.
        Substring(' '+[date]+' 1st',PatIndex('%[ ~1-3][0-9][ tsnr]%',' '+[date]+'~1st'),2) AS TheDay,
         --we look for a four number string from 1600 -2999 with a default of 2000
    Substring([date]+'2020',PatIndex('%[12][67890][0-9][0-9]%',[date]+'2020'),4) AS The_year,
        --we search for the month by looking for the the first three characters in the twelve names of the month 
        -- preceded by one of a limited number of characters
    Substring(' '+[date]+'Jan',PatIndex('%[ .TDH1-0][JFMASOND][AEPUCO][NBRYLGPTVC]%',' '+[date]+'Jan')+1,3) AS The_Month,
        [date] as fdate 
FROM dbo.NewspaperStories)f(TheDay,The_Year,The_Month,fdate)
WHERE The_Year ='2020'
or try_convert(date,TheDay+' '+The_Month+' '+The_Year,113) is null

Having go this far, we can create a function. I hate doing this until I’m confident that something works well because functions, I believe, can only be debugged in a test harness. Once you’ve glued it into a function, you’re more limited.

CREATE OR ALTER FUNCTION [dbo].[Try_DateFromString] (@DateString nvarchar(80))
/**
Summary: >
 This is a function for parsing a manually-entered date string into a date.
 we assume that somewhere in the string there is a day number, at least the 
 first three characters of a month (in English), and a four-figure year
Author: Phil Factor
Date: 17/09/2020

Examples:
  - Select dbo.Try_DateFromString('August 6th 1868.')
  - Select dbo.Try_DateFromString('about 1st September 1768.')
  - Select dbo.Try_DateFromString('it was, I think, 28 Feb 1768 or thereabouts')
  - Select dbo.Try_DateFromString('it could have been Februry 3 .1768 perhaps')
  - Select dbo.Try_DateFromString('Aug 6 1845')
  - Select dbo.Try_DateFromString('6thAug 1845')
  - Select dbo.Try_DateFromString('6th.Aug-1845')
Returns: >
 Date
**/
RETURNS date
AS
 BEGIN
  DECLARE @attempt DATE = Try_Convert(DATE, @DateString, 113);
  IF @attempt IS NOT NULL RETURN @attempt;
  SELECT @attempt =
   Try_Convert(DATE,
   --look for, and extract, a number between 1 and 39, by looking for the number with either 
   --a space or a th,st, nd or rd after it
   --include a leading space just in case the date starts with the day number and add a default
   --so that if no suitable string is found, the default is the 1st of the month
   --occasionally you get a false positive for the date coming from the year so we out in
   --some trick code using '~' to prevent it happening.
   Substring(
         ' ' + @DateString + ' 1st',
         PatIndex('%[ ~1-3][0-9][ tsnr]%', ' ' + @DateString + '~1st'),
         2
        ) + ' '
   +
   --we search for the month by looking for the the first three characters in the twelve names of the month 
   -- preceded by one of a limited number of characters
   Substring(
         ' ' + @DateString + 'Jan',
         PatIndex(
             '%[ .TDH1-0][JFMASOND][AEPUCO][NBRYLGPTVC]%',
             ' ' + @DateString + 'Jan'
             ) + 1,
         3
        ) + ' '
   +
   --we look for a four number string from 1600 -2999 with a default of 2000
   Substring(
         @DateString + '2020',
         PatIndex('%[12][67890][0-9][0-9]%', @DateString + '2020'),
         4
        ), 113);
  RETURN @attempt;
 END;

As you’ll see from the examples, this is pretty lenient about the format of the date. It had to be, because correcting all the dates at source wasn’t an option.

With this done, search results could be retrieved on date ranges, which is a great bonus. 

I’m much happier with regex queries than the SQL wildcard convention. However, with some ingenuity they can be made to work. By providing defaults, you can cut down the complexity of the logic but there is always a danger of introducing errors that in some circumstances would be damaging. In my case, this would only apply to a faulty reading of the year and a default year of 2020 would, in this case, retrieve the worst critters. 

Testing  of this has to be done by eyeball. I use random samples of 200 records. it could be done with a function that uses an entirely different algorithm to parse a date that is liable to manual errors,  but for the life of me I can’t think of one so liberal!

The post Converting Manually-Entered Strings to Dates appeared first on Simple Talk.



from Simple Talk https://ift.tt/3mwzmy8
via

No comments:

Post a Comment