Dienstag, 1. November 2011

Batch processing with myBatis

I'm currently working on a clearing application for a big german bank. It's the kind of system where you get hundreds of thousands or even millions of data sets per day and have to process them over night. For this kind of problem. you cannot just load all the data into memory. It may work for the moment, but what if the bank acquires new businesses and the data load suddenly explodes? That's not something you can compensate with hardware.
Luckily, there is no reason to hold the whole result set in memory. Clearing usually involves some preprocessing on a per-transaction basis and computing lots of sums after that. For this you don't need more than one data set in memory at any given time. This is where myBatis' custom ResultHandlers come into play. The default implementation is to just put each row from the database into a list and return that list to you after the result set is read completely. But you can change this behaviour by supplying your own handler.
For our clearing example this would envolve preprocessing each row by itsself and adding some of its values to the sums we're computing. After this is done, we no longer need that row and thus we can let the garbage collector do its business.

So here is how one of those batch processings might look:

 private class TransactionHandler implements ResultHandler {  
  long impotantSum = 0;  
  Date clearingDate = new Date();  
  SqlSession session;  
  public TransactionHandler(SqlSession session) {  
   this.session = session;  
  public void handeResult(ResultContext context) {  
    CreditCardTransaction transaction = (CreditCardTransaction)context.getResultObject;  
    importantSum += transaction.getAmount();  
 public void doClearing() {  
  //assuming you already have the factory somewhere (e.g. in a container)
  SqlSession session = factory.openSession();
  TransactionHandler handler = new TransactionHandler(session);  
  try {  
    //the first parameter is the name of the mapped sql statement
    //the second are your query parameters like "all transactions whose status is still set to 'New'"
   session.select("findTransactionsToClear", queryParameters, handler);  
   long sum = handler.importantSum;  
   //do something with that sum (probably exporting it to a file)  
   //after everything went well, we can commit our changes  
  } catch (RuntimeException e) {  
   //if something goes badly wrong, we can rollback the whole clearing run,  
   //leaving the system in a consistent state  
   logger.error("Clearing failed", e);  
  } finally {  

With this approach you can process millions of data sets with constant memory consumption on the Java side of things. Of course, your database will have to cache lots of updates, but that is what it was built and optimized for.

Handling updates within the result set

There are some subtle problems to consider when processing involves cross-updates between rows from your result set. This happens for instance when you need to find the original transaction for a reversal (A reversal is sent by another bank and means "Whoops, our fault, please give that poor guy his money back").
Take the following logic as an example:
"When a reversal is found, we need to find its original transaction in the database. Once we have found it, we set its status to 'reversed'".
Now there is a small probability that the orginal transaction is being processed in the same clearing run and thus is part of the result set we are iterating over. If it is further down in the result set, it will not "see" that we have reversed it. This can be a problem if its clearing logic depends on that ("Set status to 'processed', unless status already is "reversed'").

There are several ways of ensuring that such changes don't get lost like in the example above.

1. Use ORDER_BY to sort the data in such a way that there are no forward references, only backward. In our case, this was the best solution. Our domain export assured us that reversals will always arrive at our system at a later time than the original transaction, never at the same time. So we just sorted by incoming timestamp.
2. If the number of changes is small, you can just remeber the IDs of the rows you changed. Then for each row, you look up if it was changed and if so, do a select by id on it. This way you will get the most recent state from the database
3. Use a Scroll_Sensitive result set with fetch size 1 (both can be set on a per-mapping basis). This solution is extreme. All changes will be visible immediately, so this can be used when 1. and 2. are not possible. But beware, this comes at the cost of major performance degradation. Use this as a last resort only!