I’ve started working through Dave Thomas’s Code Katas, as previous readers to this blog will be aware. I tackled the previous kata (Code Kata 2: Karate Chop) in Clojure as a way to help learn Clojure. I was planning to attempt the next kata in either Clojure or Erlang, but decided I would implement a solution using my usual language: Python.
The kata is in three parts, and I had to resist the urge not to read ahead and spoil the process. You can find the details of the kata at http://codekata.pragprog.com/2007/01/kata_four_data_.html
Part 1 involved reading values from a weather data file. The data in the file looked like a plain text conversion of an HTML page and had a bit of surrounding fluff that needed to be ignored in order to parse the tabular data. Through sheer laziness, I opted to use a regular expression to match lines with the data I was looking for and scoop the data out.
Part 2 was a similar task, but with football data. In this case, using the regular expression was more of a pain so I decided to split each line on whitespace and use only those rows that could be split into 12 items. All the rows I was looking for matched this criteria and it was quick to do. Unlike the first program, I ended up with more fields than I was planning to need.
Now comes the interesting bit: part 3. The task was to take the two programs and factor out the common parts.
To what extent did the design decisions you made when writing the original programs make it easier or harder to factor out common code?
In both cases, the code was in two functions. The first function parsed the data file, returning a list of valid rows. The second function took the list and identified the row which had the smallest difference of two values. There was a definite duplication of code in both, with differences in logic. The parsing functions iterated over the files the same, but the data extraction differed sufficiently. The other function was much more similar between the two programs.
Take the two file parser functions. For part 1:
def parse_file( filename ): data_file = open( filename, "r" ) data =  for line in data_file: match = DATA_LINE_RE.match( line ) if match: dy = int(match.group(1)) mxt = int(match.group(2)) mnt = int(match.group(3)) data.append( (dy, mxt, mnt) ) return data
and part 2:
def parse_file( filename ): data_file = open( filename, "r" ) data =  for line in data_file: columns = WHITESPACE_RE.split( line ) if len(columns) == 12: data.append( columns ) return data
The core of both functions can be distilled down to this:
def parse_file( filename, extraction_function ): data_file = open( filename, "r" ) data =  for line in data_file: extracted_data = extraction_function( line ) if extracted_data: data.append( extracted_data ) return data
The domain-specific logic is now encapsulated in a function passed to parse_file. I can easily substitute new logic for future problems by passing in a different function. The extraction function should return either a tuple/list or a value that equates to false if no data could be extracted. I’ve used None in my code but an empty list would also suffice if it made sense for the problem at hand.
The second function in part 1 looks like this:
def print_day_with_smallest_spread( data ): smallest_day = None smallest_spread = None for dy, mxt, mnt in data: spread = mxt - mnt if smallest_spread > spread or smallest_spread is None: smallest_day = dy smallest_spread = spread print smallest_day
In part 2 it looks like this:
def print_team_with_smallest_goal_difference( data ): smallest_team = None smallest_goal_diff = None for row in data: goal_diff = abs( int(row[GOALS_FOR]) - int(row[GOALS_AGAINST]) ) if goal_diff < smallest_goal_diff or smallest_goal_diff is None: smallest_team = row[TEAM_NAME] smallest_goal_diff = goal_diff print smallest_team
There are some small differences, but the logic is essentially the same for both. The use of absolute values in the goal difference is a notable deviation from that used in the first procedure. However, while above what is required for the weather data it is also valid. I toyed with the idea of leaving this, but the change would mean improved commonality and make the factoring out easier: without changing the result. This was the deciding point and I made the change.
The extraction of values from the data list was the only other differing point, but both were doing the same thing. Some little tweaks resulted in a procedure suitable for both:
def get_item_with_smallest_difference( data, label_index, value1_index, value2_index ): smallest_label = None smallest_diff = None for row in data: diff = abs( int(row[value1_index]) - int(row[value2_index]) ) if smallest_diff > diff or smallest_diff is None: smallest_label = row[label_index] smallest_diff = diff return smallest_label
Was the way you wrote the second program influenced by writing the first?
Definitely. It’s difficult not to be influenced by code you’ve written, or seen, before – especially when you have implemented solutions to two similar problems in close proximity. I was about to approach the second part in the same way as the first, using a regular expression to extract the data I needed. However, I realised that this would yield a large and unwieldy expression, so I decided to go for an approach I had briefly considered for the first part: split each line on whitespace to yield a list of values. I ended up with values I wasn’t going to use, but it was a cleaner approach for that particular problem.
Is factoring out as much common code as possible always a good thing? Did the readability of the programs suffer because of this requirement? How about the maintainability?
It’s usually a good thing, but in software development there are no such things as rigid rules. Readability of the refactored programs has some plus/minus points. The core of the problem was made more obvious from the refactoring, but the naming of the function to determine the smallest difference became more generic. That generic name is perfectly fine, but it might’ve made more sense to wrap the call in procedures named better for the specific problem domain. The idea of constants, to aid readability, in the second solution (to identify fields we’re interested in) was transferred to the first solution which was a useful touch.
Overall, maintainability improved slightly. The duplicated code is factored out and can easily be reused for writing code to solve other problems of a similar nature. If a bug is found in one program, there’s a good chance it’s in the common code and thus can be fixed in both easily. Likewise, I could make performance improvements that would be applied to both easily. By splitting the domain-specific logic out from the generic core of the solution, I made the code easier to test by providing a discrete function for the logic, as well as a way to inject a fake or mock function into the parsing code.
Kata solutions for this and the previous kata are available under the MIT License from BitBucket: http://bitbucket.org/metaljoe/metaljoe_codekata/