Tuesday, August 2, 2016

Efficient MySQL Date Verification in Javascript?

I'm not the best person I know at determining what is efficient in JavaScript (ECMAScript) though I would like to think that this could help someone.

 * Make sure that the passed value is valid for the proposed condition. If
 * isRequired is true, dateString must not be blank or null as well as being
 * a valid date string. If isRequired is false, dateString may be blank or null,
 * but when it's not, it must be a valid date string. A valid date string looks
 * like YYYY-MM-DD
 * @param dateString {String}
 * @param isRequired {Boolean}
 * @returns {Boolean}
function isDateValid( dateString, isRequired ) {
    var regex = /^\d\d\d\d-\d\d-\d\d$/ ;
    var retVal = true ;

    if ( ! isRequired ) {
        if ( ( null == dateString ) || ( '' == dateString ) ) {
            return true ;
    else {
        retVal = ( ( null !== dateString ) && ( '' !== dateString ) ) ;
    retVal = ( retVal && ( null !== dateString.match( regex ) ) ) ;
    if ( retVal ) {
        var daysInMonths = [ 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 ] ;
        var yr = parseInt( dateString.substring( 0, 4 ) ) ;
        var mo = parseInt( dateString.substring( 5, 7 ) ) ;
        var da = parseInt( dateString.substring( 8, 10 ) ) ;
        if ( ( yr % 4 ) && ( ( yr % 400 ) || ! ( yr % 100 ) ) ) {
                daysInMonths[ 1 ]++ ; // Leap day!
        if  ( ( yr < 2000 ) || ( yr > 2038 )
           || ( mo < 1 ) || ( mo > 12 )
           || ( da < 1 ) || ( da > daysInMonths[ mo ] )
            ) {
         retVal = false ;
    return ( retVal ) ;
If you know of a more efficient way to handle a MySQL (YYYY-DD-MM) date validation, please reply to this post. :-)

Thursday, March 26, 2015

Estimating MySQL rollback time in InnoDB

Estimating (computing) rollback time in MySQL can be a bit of a pain (art), but it can be done. MySQL provides data about rollbacks in InnoDB through SHOW ENGINE INNODB STATUS output. Here's a sample:

---TRANSACTION 2920ACF08, ACTIVE 12568 sec rollback
ROLLING BACK 2426027 lock struct(s), heap size 216037816, 52624206 row lock(s), undo log entries 3226386
MySQL thread id 5669944, OS thread handle 0x2b126bd21940, query id 2028903424 user41
# Query_time: 8736.352709  Lock_time: 0.000151 Rows_sent: 0  Rows_examined: 52624206
SET timestamp=1427378149; 
(query being rolled back)

For estimating rollback time, the important bits here are:
ROLLING BACK 2426027 lock struct(s), heap size 216037816, 52624206 row lock(s), undo log entries 3226386
The "undo log entries" value is the number of undo logs remaining to be rolled back. This value decreases over time. To get to the time remaining, we need at least two samples of this ROLLING BACK line along with timestamps of when those were taken. Put those together and you'll get the rollback rate. Here's rollback rate per second:
rbr = (staring log entries - ending log entries) / (end time in seconds - start time in seconds)
My experience has been that as undo log entries approaches zero, the rollback rate tends to increase. Having said that, if you have your $HOME/.my.cnf set up with your credentials in it, you can use something like this from a Unix (/bin/sh) shell prompt to see a likely up-to-date prediction of when rollback will complete:

$ rbr=161 # Rollback rate per second 
$ while true ; do clear ; x=“`mysql -h HOSTNAME -e 'show engine innodb status \G' | grep ROLLING\ BACK`" ; echo "$x" ; echo -n "Minutes to go: " ; expr `echo "$x" | cut -d, -f4- | cut -d' ' -f5-` / $rbr / 60 ; sleep 5 ; done

It's technically possible to dynamically adjust the value of $rbr, but I've found that it tends to lead to more frustration.

Saturday, March 2, 2013

GTD Done Wrong?

So I just saw this in the GTD blog and wondered if I agree. I'm pretty sure I don't agree.

Priorities and Goals... The GTD Achilles Heel? David Allen Company Forums by commmmodo on 3/2/2013 1:24 PM
After using GTD since 2007, I have found priorities and goals to be t?he system's achilles heal.
I think there's a fair chance I'm doing something wrong, so I want to give the community a chance to correct me and defend GTD.
Time in life is short and finite. I have reached a conclusion that if you want to achieve big career and life goals, you have to cut out all of the unnecessary projects/tasks and focus exclusively on the absolutely best project that will advance you to that goal. There are lots of tasks and projects we could do, but 80% of our energy should be put into the 20% most important projects.
So I ran a little GTD experiment recently. At my weekly review, I started setting top priority projects for a 3-10 day span and timeboxing it. The idea is to find the project that is holding me back from the next level of success in life, and get it completed in a set number of days. So for example: until March 10th I am working on our fundraising documents for people to invest in our company, and after March 10th it's being marked DONE.
During my experiment, I replied to as few emails that don't deal with this project as possible, put off meetings on other projects, and anything that isn't directly achieving the goal I set. I went in my office and closed the door, metaphorically and literally. Because, really, I can do all of the medium-priority tasks I want... and they're not bad things to be working on... but if I really want to advance my career and my company to the next level as quickly as possible, this top-priority project is all I should be focusing on. It's a harsh reality. I guess an analogy would be, as Warren Buffet says, "Putting all of your eggs in 1 basket and watching it carefully." Instead of watering a thousand roses with my finite water bucket of time, I am watering 1 flower with a lot of water until it's bloomed big and strong.
I was a little upset at how well this experiment went, since I have trusted David Allen and GTD to tell me the best thing to do for 6+ years. The results? I got what would have taken 20 days done in about 4. I achieved my goal, and it moved the company and my life forward in a really big way.
GTD's answer to this, as I understand it, is pretty simple: set 50,000ft, 30,000ft, and 20,000ft altitudes (areas of responsibility and major goals) and review them at your weekly review. Then, as you go through your day, pick out next actions based on context, time, energy, and priority.
The problem with this GTD goal and priority system is: you're never picking out 1 30,000ft goal that should be done next, and systematizing it into your daily routine. There's context lists and project lists... but there's no "Do This Project and Nothing Else if You Want to Advance your Life And Career" list. There's no part of GTD that focuses you on that next most important goal. Instead, you're assessing goals and priorities every 5 minutes, and that creates a mental fatigue of sorts. That 3-10 day goal is never written down, making it easy to lose sight of what you really should be doing, even though you may identify this important project during those precious moments of weekly review zen.
Out of practicality, I've started doing a new activity during my weekly review: "What is the next most important project to complete that will advance my life and career more than anything else?" I write it down, open up Omnifocus, and hide all other projects except that one.
Therefore, I've started to see GTD as a sort of hamster on a wheel, a way to spend time on a lot of stuff that doesn't matter and avoid the harsh reality that I should be focused on the one project that actually matters, and saying "f*** everything else."
My question is: why aren't priorities and goals a part of GTD? Is GTD just that? Getting THINGS done. Don't we really want GTMITD? Getting THE MOST IMPORTANT THINGS done? Okay, okay, the acronym isn't as sexy. But life is short, time is finite, and priorities (as defined by your larger goals) need to be systematized. I need something where I can go on autopilot during the work day. That's the whole point of mind like water, is I don't need to be thinking about my task system all day long. I need a better answer than, "Set up your 20,000ft review, and then reanalyze your priorities every time you complete a task." It's not working for me.
I hope this explains the problem clearly. It's a complex situation, therefore I may not have explained everything you need to know to render a reply. Please feel free to ask followup questions and I'll respond to them promptly. Thank you.

Does anyone watching this blog have any comments on why this GTD practitioner should feel he wasn't using GTD while working on the fundraising documents given the information provided?

I don't consider myself a GTD expert, but I do think "commmmodo" was actually using GTD properly during the "experiment" because he/she elected to prioritize the fundraising document above nearly everything else for a limited time. Maybe I'm missing something.

Thoughts anyone?

Tuesday, November 27, 2012

NoSQL vs. SomeSQL

Linux Journal had a fantastic article (SQL vs. NoSQL) some time back. While I know this is a bit of hopping on the bandwagon, I like the point this video is trying to make: http://www.xtranormal.com/watch/6995033/mongo-db-is-web-scale. Caution: The language used in this "video" may not be appropriate for some viewers.

There are lots of folks out there that like to tout numbers on performance and how sometimes performance is really fast under "ideal" conditions, but as both point out, the trick is to know how to balance performance/scalability, reliability, and availability. /dev/null is extremely scalable and available but it's completely unreliable. In MySQL - the Blackhole storage engine has a lot of the same performance metrics but used properly, can be a great way to "pass through" data in a replication ring.

Sunday, February 12, 2012

A basic shared-nothing data sharding system

There's a lot of buzz about sharding data. Today, I'll provide a very brief overview of how sharding helps systems I manage run more efficiently and how we're addressing keeping individual shards balanced.

The goal of sharding data in our environment involves: 1) make the structure of the data consistent across all the shards, 2) dividing data up so it can be found easily, 3) automatically and continuously re-balance the shards, and 4) allow for changes in scale (like adding a new shard or different shard sizes).

Item 1 is a snap - all we do there is to deploy the same data structures in each of the shards with all the supporting data required to answer questions related to a user. Some of this data is user-specific, some is globally replicated. In any case, this goal makes it easy to use one set of code to access data in any of the shards without having to cross to another shard or database to get the answer for a question. This reduces workload in the application and on other database servers.

Item 2 is done by hashing our key data. Let's say that we have a set of widgets that users are concerned with. Some users have a few widgets, some have a lot, but each user is very different from another. Widgets are pretty common and well defined. Each user has a user ID and any question we ask the system always involves a specific user ID. So - our key data we hash against in this case would be the user ID. Data about the widgets is replicated to all the shards, but data about each user is only kept on the shard where that user's data lives.

Item 3 is handled by a separate process that utilizes the same API the application uses. Balancing the data between shards is simple - the balancer asks the API if there are any users that need to move. If yes, the balancer lets the API know to lock that user temporarily, moves the data, then unlocks the users for use on the new shard. What this means for applications is each time a location is returned for a specific user, that location is only guaranteed for a given window of time (30 seconds for example). So - when the balancer tells the API it's moving a user's records, any requests for that user's records are held up until the user's data is moved. The API is smart enough to only let the balancer move data that has not been accessed recently. This doesn't prevent all lock collisions, but it handles most of them.

Item 4 is handled through the configuration of the API. Because we use an API to tell the application where the data is for a given user, we've abstracted away where data actually lives. This makes it easy to add and remove servers from the sharding pool. We've extended this to include allowing a shard to be marked as in a draining state. When a shard is draining, the API will ask the balancer to move rows from the draining shard and redistribute that information onto other members of the sharding pool. This makes it possible to take a shard out of rotation for routine maintenance without the loss of data.

Notice that I didn't mention any specific software here. I didn't tell you what language the application is written in, what language the API is written in, or what the actual data store was. The technique of sharding data is pretty simple and can be done with nearly any persistence layer using any programming language.

The beauty of this system is that once the API is written, the balancer can be a complete "black box" to the application. This type of system could be implemented with a data store when just starting out and be expanded to multiple stores as the need expands. Also - sharding key needs to change, again, the application doesn't need to change - just the API and the balancer.

One other big benefit to sharding data like this - it's often a lot cheaper to buy several smaller systems than to buy and maintain one very large system. If one of the systems in the sharding pool goes off-line, the worst possible exposure in a shared-nothing sharding system is the data stored on the member that went down. In a monolithic system, you stand to lose a lot more.

While I wouldn't suggest trying to do this type of work on top of every data set out there, I do see that there is a lot of benefit when the types of questions being asked of a data set can be divided up easily while still making it relatively easy to answer the "question at hand" from a single source. The secret in the sauce is making sure that any common data is shared among all the systems in the pool.

Sunday, January 15, 2012

Managing incoming emails

Reading emails all day long tends to be very counter-productive for me. I usually end up responding faster than anyone else which generally gets me a lot more work than I need. At the same time, I have a responsibility during my times as primary and secondary on-call to respond within our service level agreement. So - how do I find balance? My team and I use mailing lists to help us manage those truly urgent issues versus those issues that can be handled as time allows. We have three lists:


We've published these three lists to our operations center. Everyone else just gets the admin list. We don't tell others about the primary and secondary lists because anything we'd get on primary or secondary would need to come via the operations center anyway. We also don't want our over 600 co-workers (not on our team and not in the NOC) to email us willy-nilly using our on-call emails.

Next, on each of our team's smart phones, we've set them up to recognize emails going specifically to the primary and secondary emails so our phones will either go off like a pager or (in my case) read the sender and destination email (think "Inbound Primary email from the NOC"). That prevents me from having to look at my phone every time a new message comes in but lets me know when there's something that requires my attention.

The other thing we do is to make it easy to change the destination for the primary address easily so that only primary gets notified. Secondary is notified in the same way but on my two-man team, there are only two of us so secondary always goes to the whole team (for now).

Finally, to help us have reasonable sanity, I do what I can to only check the "other" emails twice a day.

The  net result of this process is I am able to focus on getting project work done between routine email readings and it lets others figure things out for themselves or wait a bit for an answer. If it was truly urgent, the sender could simply ask the NOC to reach out to the on-call person to get a faster response.

How do you deal with your on-call processes and email?

Monday, November 28, 2011

Watch this video for instructions on how to use indexes better

This is the first video I've ever seen that visually represents how indexing works. I've seen good stuff before but this ... wow. Yes - it has some stuff about Tokutek in it, but that's not why I suggest watching it - it's because it makes you re-think how to define good indexes.