The Task: Take and load, using MySQL, a flatfile of over 450 colums and over 145,000 rows of data and put it into a staging table.
The Conditions: The data may not be altered prior to the staging load, or in the staging load. Only when the data is loaded from staging to final form may it be altered.
The Platform: MySQL on a Linux OS base.
The Tools used: CSVed, Notepad, Winzip, MySQL, XAMPP (myphpAdmin), Excel 2007, 1 brain.
Of course, one of the most important things was the understanding that though I am faster and more knowledgable with Windows systems, the end result must exist on Linux. Thankfully MySQL runs on both platforms, and it’s command line engine works the same (with the exception of file location syntax and directory access permissions, but these are solvable) so I can test and design on Windows, and then port to Linux with a minimum of alteration required. The data file given to me was, in the words of one of the project leads ‘a horrendous example of a file that is NOT 3NF’. This was an accurate assesment. Intially, I saw a very large file, that used a poor choice of field delimiter. Using CSVed to easily strip out the first row (header names) I was able to construct a table creation script, keeping it as close ot the original names as I could, only replacing spaces with underscores, removing potential reserved or special character conflicts and the sort. Notepad and Wordpad went toes up even attempting to load the main file, and even Word itself was very argumentative. So CSVed was a good thing to have, even though it too had some minor problems, mainly display issues, with the file given its size. Cross your fingers, and try an initial load based on first guesses of the fields. And the result is so bad, it surpassed the MySQL hardcoded counters for errors and warnings at load. A little digging and we find some major culprits. A novel here and there in the notes fields, date fields that are strings, and number fields that are not long enough. Now it’s perfectly possible in MySQL to alter a column’s type and size through the command line. However, XAMPP (via phpMyAdmin) provides a way to do it graphically, and affect multiple rows at once easily, so it makes the job much faster. This was a tool of immense value to this task. I could also use it to purge the table of data to help keep the memory footprint lower during the testing and working out of the problems. A few changes later and I got the errors lower, but not by much. So, using Excel (2007 is the only option since previous versions cannot process more than 65535 rows very well, if at all) and a file of notes, I looked at every single column and adjusted sizes. Given the column count and contents, I had to be concerned with the max size limit per record imposed by MySQL as well as max field size. The later I could work around by use of a different datatype, akin to BLOB or CLOB in Oracle, the former though was something to keep track of. After going through it all, I still had an immense number of errors. So I try something simple, I added a dummy column to the end of the table, and in a single step the error count dropped to a mere 6000 errors and warnings. Progress is good.
The problem still remianed where all these errors were and what was causing them. The solution was arduous, but the safest. Manually extract and test the records from the master file and separate the good from the bad. In small situations this isn’t a big deal, but we were talking over 145,000 records. Testing them 300 at a time still took a while. The relative small number was due to the fact that the MySQL warning display defaults to only showing the first 64 errors or warnings, so I had to reduce the number of attempted records to be able to see and affect the changes needed to find truly bad records. In the end I as able to isolate the roughly 220 bad records and set them aside. Reassembling the test files back together I tested the ‘good’ records, and got zero load warnings. Good. Beginning with the bad records, I was able to isolate a few dozen that were simply caused by incorrect field lengths, since after all, phone numbers are supposed to be 35 characters long, aren’t they? Fixing those, I got down to a mere 195 bad records, less than .2 percent of the total. Acceptable in most circles. The remianing bad records were attributed to either more columns than needed, fewer columns than needed, or data shifting (after a point all data is one column to the left or right of where it shoudl be). Sent the bad records back to the boss, and tested the results of the good data on the Linux test box, with success. Now to just await the next task while streamlining this one a bit more.
I return again to my APEX SQL command box, break open my books and guides, and prepare once more to do battle. In the time I spent away from it, I had many concepts and ideas floating in my head, and rather than do the smart thing and look them up in the book, I try them out in my own example system. I experiment with customizing column display headings, and data formatting in the retrieved data. Glancing only at the ORACLE guide to find exact format syntax, I manage to muddle through making a more customized display, retrieving and combining string data, as well as performing some time math against other data. Once I make sure I save hte query for alter use, I return to my books. Today it’s all about expanding the usability of the WHERE and building better, leaner, more intricate filters.
SQL gives us many things other languages don’t. While you still have to write them out, and provide the logical and precedential constructs to them, I find it easier to construct a multi-layered criteria model for testing data than with other languages. It is still however a good idea I find to actually write out the conditions for the test manually, then convert them into code. Hmm, sounds like design, must be a fluke. Keeping a running log of all the queries that have worked, and the various ways to test data coming back out, be it through a simple value comparison ( equals, greater than, less than) or delving into string math and string comparisons, or looking at date differentials, you can begin to see how versatile you can get. Then they spring the kicker on me. You could use a query inside the WHERE clause to further the filtering. Very nice. Now I could use another query to generate a list of good, or bad values to test against. Oh the power.
After a short power trip of making queries that were insanely ugly to test this new found skill, I returned to my more mundane ones, and sought out the next great answer. And the next page held the answer, how to re-order your results how you want. So now, I could use a normal sorting routine to sort my data, and by multiple rows as well, in differing directions if needed, instead of the default sort of the order the data was read from the table. Now my data was growing more readable and organized with each iteration and application of knowledge. I suppose this is how learning is supposed to be. The book then takes a break from the WHERE and ORDER and FROM of daily life to discuss a concept I had heard about, mainly normalization.
Basically at it’s core, normalization is merely a set of guidelines to better help you design (there’s that word again) your database tables and how they will interact. A simple set of guidelines that, as the book stops at third normal form, while the grizzled old veterans I know speak of the evils of fifth normal form and the battles to the death with the data demons of sixth form. Someday maybe I’ll get to that stage. The basic premise I took away from it was that it was a very good idea to actually pause and think about the design BEFORE making it, it’s easier to measure 3 times and cut once, than it is to try and be inside the database 2 years after it went live applying electronic nails and bailing wire to make it work better due to jumping in with no planning.
I go over the guidelines and apply them to various test ideas I had, and other databases I work with, and can see where the programmer did some design work initially, and where they didn’t. Making a few notes, I dive back into my SQL pool and press on to subqueries. A great little tool when used right, a horrid resource drain if not. Subqueries allow you to place a query inside another query, extending the scope of the overall statement, also allowing you to read from multiple data tables at the same time. I apply this new insight to my own work, and manage to create a set of nested subqueries inside of a main query and it seems to work ok. It appears to run fairly well, of course, since I’m the only one who can use it, I won’t see any major difference in the performance, at least not yet. I take a look at my examples, and try to add another subquery, and I hit a block. I double check all my work, and it seems fine, the subquery will work as a stand alone query, but not as a sub query. It is perplexing.
I enjoy problems. Part of the learning process is working through them. Figuring out what the true problem is, and fixing it, or if it’s out of your league, finding the help you need to fix it. I spend many many hours trying to fix the problem, but without results, I have a feeling where the problem is, and it is my thought that the problem lies in a misunderstanding of a somewhat poorly worded explanation in the book. I am reluctant to, but I go to an outside source for help, and with their help, I get a much easier, and simpler explanation of the problem, and the concept I was working on. Applying what I was told, I now can get the whole example to work. Not only that, but the statement is smaller, and runs faster now as well, prompting me to examine the differences between the original query, and the one I received help with.
By now I had touched on most of the basic fundamental aspects of SQL. Creation of items, retrieval of data, filtering of data, updating data, even removing data. There are things I still don’t understand fully (Joins…you are a complex beast) but I also know that poking around with sample data in a learning environment only will get you so far. There are still more things I need to cover, and concepts that are a part of SQL, and they are looming on the horizon.
So you want to learn SQL. Not a problem. You want to learn ORACLE SQL, not a problem. You have no knowledge of SQL, not a real problem. All you need is some patience, access to a computer and the internet, and a little bit a familiarity with how to use a computer. I was asked to begin reviewing SQL for work, and to focus on the fine points of the beast that is ORACLE SQL. I have done some SQL before, many moons ago, shortly after dinosaurs roamed the earth and a 720K 3 and a half inch floppy was a neat thing to have. After going over my old mundane SQL work that was written into an old Visual basic program against an Access database, I found myself floundering in the immensity of my new challenge. SQL has come a long way since the beginning, and is more powerful and robust than ever. So I go to ORACLE themselves and humbly acquire the ORACLE 11g SQL Reference guide, all 1500 pages of it in electronic format. I also get the PL/SQL (anticipating the need in the future to learn this as well) guide. Now don’t get me wrong, I consider myself fairly intelligent and logical in my process of learning. But a few pages into the guide and it is obvious the ORACLE guide is meant for the more advanced users of SQL. So, I retrace my steps, and believe it or not, find SQL for Dummies, and another third party book (which actually the most helpful) that make much more sense for the new user.
As most know, SQL is accepted as a de-facto standard for working with databases, with several major sub-flavors. Among these are IBM-DB2, SQL Server, MySQL, Access, and Oracle. Thankfully between ORACLE and the books, the major flavors are covered quite well, as are the differences and quirks about some of them. But onto the work at hand. As a beginner in this world of Queries, I knew a little about the basic SELECT, INSERT, and UPDATE structures. But I wasn’t fully aware of how they worked and the other aspects of SQL that were out there. Given the non-procedural nature of SQL, it’s difficult to learn in the traditional sense. Many of the more complicated things you can do are combinations of simpler concepts. Also, since it’s designed to work with data, you either have to have examples that generate the data tables for you to use (and explain the ‘don’t worry, we’ll cover how this works later’ system) or you need a preset environment with data already there.
Thankfully for me, and my own learning style I had such a platform, my APEX project was still installed, and had the sample data in it from when I was learning how that worked. APEX is a wonderful tool for this I found out, as it has a built in SQL command box that you can use to test your examples, write your own SQL statements, as well as look at the table structure and layout, AND the SQL that would generate those very tables, letting you test yourself in the age old practice of reading code and figuring out what it does.
So we start at basic table creation. Learning how to make a table, add a column, modify a column, and drop a column or a table. What we create we can also destroy. A good thing to know. A bit of conceptual theory behind what makes a good database design, tips for better data structures, and a lot of ideas that get formulated and churn in my mind as I read the books. Occasionally I look up from my books and try the concept being read about against the test data in my APEX install. For the faint of heart, I would say to do exactly what the books say, shows you very simply how it is done. My own method says take the concept and apply it, and hope it DOESN’T work. That way, I have to work at it, and figure out why it didn’t and where I goofed, leaving me with a better understanding of what I’m doing. Works for me quite well, and at the same time, since it’s on non-critical data, not much is lost if I really enter the wrong command. Were I working on live (or a copy of live) data for a client, the safe road becomes the method of choice.
After we create a few tables, and test the waters of our basic understanding, we start building a set of example data to be used throughout the rest of the text, entering it and updating it. We learn the basic structure of the SELECT FROM and how it works, what it does, and it’s basic syntax. From there we branch out, beginning to limit the fields we bring back, applying conditions and tests via a WHERE clause, learning how to limit and restrict the data coming back. And then problems crop up.
Although each major flavor of SQL supports most of the ISO standard for it, there are variances. In my learning, the example text was written against a previous version of ORACLE SQL (it has different code for the major flavors where needed) and some of the examples they gave no longer matched with the ORACLE guide and recommended practices. Pausing in the process of the siege, I take a few hours and track down the items that aren’t matching and essentially rewrite the SQL script files that came with the 3rd party books so they match the rules and guidelines of the ORACLE reference guide. This also impresses upon me the importance of cross-platform support, making sure the code I write will work in another SQL environment without any, or only minimal, modification.
Some of these problems are addressed in the datatypes used by the various flavors. The books covers the basic datatypes, and what they are meant to cover and store, and how they can be used. But I find this at odds with the ORACLE guide. The two sources don’t match. While there is legacy support in ORACLE for some of the standards, they recommend different datatypes than the book, and offer differences in how they are structured. For example, the book loves to use the Integer datatype. ORACLE says instead to use the NUMBER type and give it parameters. They accomplish the same thing, but go about it differently. There were have a dozen datatypes listed in the book depending on if you wanted shorts, longs, floats, double floats, root beer floats, wait, not that last one. In ORACLE land, they use a single NUMBER datatype primarily, and let you define it’s scale and precision to get the various data you need. At first a bit confusing, but pretty simple when you step back a bit and look at it. The book also loves to use the VARCHAR type, while ORACLE makes a rather large point to use VARCHAR2, citing future evolutions of the VARCHAR type that will most like cause issues with the systems if you try and use them interchangeably. In the process of going between my sources to figure out the data types I find a very convoluted thing that ORACLE calls a syntax flowchart….
I find out rather quickly to not look at the syntax flowcharts in the ORACLE guide for the commands. A fine product, but the charts are geared for highly trained individuals who are well versed in the ways of convoluted lines and tiny arrow marks. I decide to take a break for a bit, and resume later when the dust has settled and the current reading and learning has had a chance to sink in and make connections to itself I can’t see at the present.
The basics are firmly inside my head. I can find data, manipulate it, massage it, modify it, and put it back either as it was, or in a new state. But there was still a lot to go over.
Views make the world appear differently. The concept was familiar to me, although under a different concept. Virtual tables. Creating something you can query and use as if it were a table, even though it’s not. Great tool for smoothing things out, and easing the work load on the DBA. Also comes with warnings like everything else. Take a view (which is a named query) and you can manipulate the data used to create it, or manipulate data you read from the view itself. Use the view to help make another view. Layer the power on top of itself a hundred fold. Wait, no, that’s not good. Since they are queries, they also have to be run when called, so use too many of them, or use them inefficiently, and your performance drops hard.
This also touches on the next thing covered, tuning your SQL. Like a racing team, getting every last cycle of CPU, every last bit of RAM, and last millisecond you can out of the hardware to drive your database. The mundane first level of tuning was focused on hardware actually, noting that a simple increase of RAM or CPU, or separation of data table locations could increase performance of the system as a whole. Then you look at he statements themselves, either manually running through them, or using the built in tools to analyze the performance of a given query and let it try and tell you where it thinks you can improve it. Sometimes it’s a s simple as a minor re-wording of the query, other times it’s a rewrite of the query. I also covered the concept that for the most part, only a small percentage of the data is used most often, and the rest is mainly legacy support or such form long term reports. Just an interesting tidbit that makes sense, and was used to show that you don’t have to tune the entire system, but rather focus on the parts that get the most load.
A recurring concept that was brought out finally was data security and integrity. Protecting the data not only from itself, but from malicious users, and those who just don’t know what they are doing. The concept of granularity and scope was brought up, defining the level of security, from column level all the way to entire database protecting. Locking the data, and how the locks functioned, and how they are optimized and ways to consider optimizing them for better performance. The ability to grant and revoke permissions and privileges to users, and the problems of cascading permissions and how to avoid those issues was looked at. Assigning permissions to groups and then users to those groups to make for a more efficient and easier to manage security setup was discussed.
They also discussed the concept of transactions, committing changes and rollbacks. Discussing the problems that crop up if the data isn’t being monitored or watched properly in a multi user setup. Where one user makes a change to data that another user also is looking at, whether the changed data is propagated to all users properly and how to avoid problems with this situation. Rolling back changes in the data, and how to avoid problems with that process as well. Using a transaction control system to streamline and protect all of this and thus protect the users and the data itself from potential problems.
SQL is extremely versatile and powerful even at the basic level of use. The more one works with it the better they become at harnessing that power for the work they perform. My current knowledge is not enough. I want to know more, and be able to use more. While I am somewhere between novice and moderate user (by my own estimation) it doesn’t mean I cannot go further and enhance my own knowledge of the language. It is a tool, a tool of the business world. Couple it with other tools and it becomes a better tool. Tools help to build the world, and those who are willing to learn the tools can help build that world. I look forward to honing my understanding of this tool, and adding to it, with things like PL/SQL, APEX, even tools like Discoverer to use the data to further myself and those I work for.
It’s always enjoyable to be able to dust off old skills and use them in new ways, or on new projects. I find myself in this great position once again. I was handed a task to build an install guide from the higher ups. They asked for clear instructions, documented steps and screenshots to go with it all. I was already looking forward to it.
After taking three dry runs at the install process, using the manufacturers guides, I was able to distill the steps down to something a bit more friendly and straightforward, without the plethora of caveats and warning and cautions in the original guides. Wiped the computer clean and booted up my image tools, and away I went. Two hours later I had a once more running installation of the software, and a couple of dozen screenshots to go with them.
Taking my pictures to my main machine, I cataloged them, and associated them with the steps they went to, and made sure they were uniformly labeled. Down sampling them for inline sizes, I set about writing the framework instructions while the final look was distilling in my mind. Once I had it all together, fire up the HTML, and dust off the hard references and dive into the wonder world of markup languages. A few tweaks here and there, a table under here, and a horizontal line there, and I had what was shaping up to be a decent guide.
The fun part about writing HTML for me is the raw versatility you have with it, without ever having to touch any of the extended capabilities of it (Java, Perl, CGI, ASP, PHP, etc) and how decent and simple a page can look with just the basics. I find many places overlook the ability of raw HTML in favor of the flavor of the month add-on. While they are nice in some situations, in my world, the closer you are to pure HTML, the better.
So here I sit, with a guide that works on three different browsers, contains 76 images in total, and has a total footprint of just over 2 MB. Not too shabby given what I included in the page. Now I begin reviewing it, streamlining it and correcting syntax and little errors that are hard to weed out and find. Rewording sentences here, fixing spelling there, reducing the number of tags needed to make the page work, and applying some basic formatting changes to try and give it a better feel. The guide is now resting with my bosses, awaiting their trials to see if I did write a guide easy to follow, and free of major bugs. I look forward to their feedback and truly hope they find something broken.
Not that I doubt myself, but it’s always hardest to debug your own work. Let someone else use it, and you’ll almost always be able to make it better. I await the notes so I can continue to improve both the guide and my own skills.
Get around this with
Start > Run…
mstsc /console
Update: If this doesn’t work, you may be using XP SP3 (or Vista). Try “mstsc /admin”