soundex algorithm in C++

July 19, 2009

I couldn’t actually find a C++ version of the soundex algorithm (all the ones I found were C code with a .cpp extension), so I threw one together.
Here it is if anyone is interested.



static char     lookup[] = {
'0',    /* A */
'1',    /* B */
'2',    /* C */
'3',    /* D */
'0',    /* E */
'1',    /* F */
'2',    /* G */
'0',    /* H */
'0',    /* I */
'2',    /* J */
'2',    /* K */
'4',    /* L */
'5',    /* M */
'5',    /* N */
'0',    /* O */
'1',    /* P */
'0',    /* Q */
'6',    /* R */
'2',    /* S */
'3',    /* T */
'0',    /* U */
'1',    /* V */
'0',    /* W */
'2',    /* X */
'0',    /* Y */
'2',    /* Z */
};

std::string computeSoundex(const std::string &input, const int resultLength){

//keep the first character intact
std::string result = input.substr(0,1);

//compute value for each character thereafter
for(int i=1;i<input.length(); i++){
   //skip non-alpha characters
   if(!isalpha(input[i])){
   continue;
}

//uppercase the input value
const char lookupInput = islower(input[i]) ? toupper(input[i]) : input[i];
//lookup it's value
const char *lookupVal = &lookup[lookupInput-'A'];

//make sure this isn't a dupe value
if(result.find(lookupVal, 0) != 0 ){
   result.append(lookupVal);
}
}

//make sure we could actually encode something
if(result.length() >= resultLength){
return result.substr(0,resultLength-1);
}

//In cases of empty strings (or strings with no encodable
characters, return Z000
return "Z000";
}


Premature Optimization

March 12, 2009

“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.”

-Knuth

This post can be summed up in the following sentance.  ‘Make sure you can justify each and every optimization of your code, before you make the change’.  Stated another way ‘If you haven’t profiled your code, then you don’t know what to optimize’.

Now for the justification…

In my career I’ve done everything from QA, to Development, to Developer Support at Borland.  In that time I’ve seen a lot of funky code.  By funky, I don’t mean strange indentation, or variable naming, but code that jumps through convoluted hoops in the name of ‘efficiency’.  I may be overly sensitive to this, as I saw a lot of this when I was working in support (people would contact us for help after they couldn’t fix the bugs in their own code, or sometime ours).

I will say that this type of problem seems to be much more prevalent with C/C++ programmers than Java programmers.  I think this stems from the fact that most Java programmers don’t really think about memory management and performance when coding, but in their defense, they don’t normally have to.  But I digress…

Most of the hackery that I’ve seen in regards to performance isn’t really effective.  Sure, optimizing that long running startup code brings up the program faster, but since that only happens once every N number of hours, it’s not making a big difference overall.  Programmers will see some area in their code where they can use a new and sexy data structure or technique to speed things up, and will do so, justifying it as optimizing.

In almost every case that I’ve seen, programs spend the vast majority of their time in a very small portion of their code.  Optimizing that portion of code, no matter how un-sexy, is where the big gains will come from.  Also, that portion of the code base should have the best set of unit tests, so you have that much more assurance that you haven’t inadvertently introduced a bug.

Also, a lot of times the bottleneck isn’t in your code directly, but in a library or downstream dependency.  If it takes 2 seconds for for your query to return results, then you need to make sure your tables have the correct indexes, add caching, etc.  *D0 Not* fork the hibernate code your using to try and eek out a little more performance.

So please, justify with an estimate of the performance gain of your proposed optimization before you start modifying the code. If you have no idea what the gain will be, then you probably don’t understand how/why the code is slow well enough to start making changes effectively.

Mat


Free books on programming

February 19, 2009

Every so often, I get asked about resources to learn various programming languages, so…

Here are some free books on learning C++ and Java.

For Java  http://www.mindview.net/Books/TIJ/

For C++ http://mindview.net/Books/TICPP/ThinkingInCPP2e.html

These are both pretty good books, and for the price you can’t beat them.  Of course, if you like them, then buy a copy and support the author.

Also, the local library has a surprisingly decent selection of technology books.


Fresh Eyes

February 10, 2009

I’ve been looking at migrating some C++ code from my group that’s from 2005.  In terms of computers, that’s a very long time and the code is definitely showing it’s age.  As I was looking at it, I saw a lot of opportunity to modify build files and get some things for free (like using Fortify for code analysis).  I also saw that some of the libraries the code is linking in are very very old.  All in all, updating the dependencies was quite easy, and testing the new build took the majority of my time.

This neglect of code (not just build systems, but general usage patterns) is something that software organization need to actively guard against.  There are a number of reasons that this happens.

One, the person or persons writing the code are using the same methodology and design that they believe is best, so they will have little incentive to go back and change work they’ve already completed.

Second, engineers like to work on new and shiny projects. No one likes doing maintenance work, and modifying build files is the worst case scenario for maintence work.  After a bit of work, you have absolutley nothing to show for it but a binary that works exactly like it did before.

Third, if it doesn’t appear to be broken, then why try to fix it?  This is a common argument against refactoring code, and I’ll address this more thoroughly in another post.

Fourth, fear of the unknown.  Engineers already have a toolkit of libraries, patterns, and common usage patterns that they know work (at least reasonably well).  ‘Stick with the devil you know’ is a phrase I’ve used more than once, and isn’t without merit.

In light of the above points, how can an organization keep code bases up to date?  I think a very effective method is to put the ‘new guy’ on the project.  Newer engineers are going to be eager, but they won’t know the systems they are working on.  Getting to know the build files and dependencies intimately will give them a good start at understanding larger codebases.  Also, new engineers (if they have any experience) will probably have some ‘best practices’ from their previous work that they can bring to the table.

Having them impliment those changes in your build process can demonstrate to the team their value, and increase adoption of those processes.  This is a win win situation for the engineer and your organization.  As an added bonus, you just might see some previous bugs disapear as you update dependent libraries.

Mat


Follow

Get every new post delivered to your Inbox.