.. title:: s5 DEMO!!!! - RandomCo Thoughts .. footer:: $Id: s5demo.txt,v 1.1 2006/04/05 17:11:27 tundra Exp $ .. contents:: Meeting Mechanics ----------------- - It would be best if we could get all the way through the presentation in "Read Only" mode. This will allow you to understand the entire analysis in context. This will be followed by an open-ended QA session. - Ed Murphy will present our findings using interprative dance. Possible Responses ------------------ - As you hear us present today, you will have one of three responses: - "Oh, I/we already know that." - "Hmm, that's new to me - good point." - "I thoroughly disagree with what you just said." Each of these are important ways that you will validate and/or act upon our findings. Assumptions ----------- - A lot of baseline data such as machine inventories, levels of utilization, application response time, and arrival rate profiles was either missing, incomplete, or inconsistent across systems. Some data (such as pricing) was completely unavailable to us. We've thus built the models using estimates for certain critical datapoints. RandomCo can update these models at will to make use of "real" data and thereby get better model output. Key Findings ------------ - The RandomCo IT culture is end-user/project focused to a fault. Infrastructure is (mostly) enhanced incrementally and is not treated as a common asset designed to serve the entire breadth of applications. - This has led to an overprovisioning of some classes of servers (Windows) and a greater variety of system types (Unix) than is strictly necessary. - Operational disciplines such as asset inventory control, measurement, and reporting vary greatly in depth and quality. This makes it hard to manage what is not consistently measured. - Business, Architecture, Development, Infrastructure, and Operations are not bound together with a common overarching view of IT at RandomCo. There is a tendency for each of these to operate moreso as silos. The Architecture team tends to have the broadest view of these issues. - Customizations to key business subsystems such as SAP and Manugistics are creating a high degree of operational complexity that may not be justified. - The strict commitment to a Windows/.NET-only development environment is unnecessarily constraining the organization's agility, time-to-market, and ability to control costs. Core Themes ----------- - Infrastructure provisioning needs to move from a project-centric model to an Enterprise-wide service model. 1) This will maximize reusability of extant infrastructure assets. 2) This will enable a systemic perspective for provisioning new infrastructure with attendant economies of scale. 3) The current complexity, variety, and underutilization of systems at RandomCo is a direct consequence of making project-based infrastructure decisions. No amount of migration/consolidation will make a permanent difference if the underlying root cause practice that caused the situtation in the first place is not addressed. - Measurement, Monitoring, and Management need to be made more consistent and reach more widely across the IT operational environment: 1) Basic asset information such as server inventory, software revision levels, machine age, and so on varies widely by platform. This makes business case cost calculations for new initiatives difficult, and in some cases, impossible. 2) Today there is wide variation in the depth and quality of performance and capacity metrics available across all the datacenter assets. This makes tuning and capacity planning a vertical, per-server activity (if at all), rather than a systemic infrastructure concern. 3) In short, "You Cannot Manage What You Do Not Measure." - RandomCo today is already making good use of virtualization in the "Big Box" Unix and Mainframe areas. This needs to be extended to the Windows-class servers as well. 1) This will allow more efficient use of existing server capacity. 2) This will enable rapid resource (re)provisioning on a project or even perhaps, event, basis. 3) This will decouple applications software from underlying operating environments by testing and certifying the application to the *virtual OS*, not the physical hardware. This will materially reduce the retesting burden currently incurred when hardware is upgraded or changed. - RandomCo should begin the necessary steps to reduce the number of different Unix variants within the IT organization and reduce its total dependence on Windows as a server platform. Wherever possible, these should be migrated to SLES Linux across the required breadth of hardware. RandomCo will benefit in doing so because: 1) This creates a common operational platform thereby reducing training cost and maximally leveraging the employee skill set. 2) This make the organization hardware-agnostic thereby providing negotiation leverage with the hardware vendors. 3) The net software licensing cost should drop significantly: a) RandomCo already has an Enterprise License for SLES. b) SLES will be bundled with XEN virtualization in future releases. This should be considerably less expensive than the separate licensing of Windows and VMWare on today's servers. 4) The first candidate for elimination is AIX. - RandomCo needs to embrace Linux as a development platform for its own customized software: 1) This will give it many more degrees of freedom in how it designs, deploys, and operates its own applications. 2) This will open the door to cost reduction by replacing expensive enabling components (like IIS) with free or very inexpensive open source equivalents (like Apache). 3) This will enable "scale" at the *organizational* level. Today, there is a significant difference in worldview, skillset, and approach between the Windows developers and the rest of the RandomCo IT community. By moving to make Linux one of the common development platforms, RandomCo will open the door to having the in-house applications it develops run on everything from an entry-level machine through an Enterprise-class mainframe. This will be done with a common set of development tools, technologies, and *people* across the organization, with a far stronger alignment between Architecture, Development, and Operations. Specific Technical Recommendations ---------------------------------- - There are a number areas for improvement that are "Quit Hits". These are relatively low risk/ low complexity and can be acted upon fairly quickly: 1) Audit all printers and replace any that are still using PC print server hosts with direct network connected printers. 2) Migrate the datacenter core LAN fabric from 100 BaseT to Gigabit ethernet everywhere. 3) Build out the datacenter switch topology to accommodate more ports, be 1G capable, and accommodate future growth. Get rid of the daisy-chained switches used today. 4) Build an IP-connected NAS in the datacenter and migrate all the corporate file servers away from locally attached storage to the NAS to provide consolidated storage, backup, management, & recovery. (It may be the case that it is easier/more consistent to actually mount this on the existing SAN and expand the SAN capacity accordingly.) 5) Continue/accelerate the path to virtualizing Dev/Test/QA images. BUT, place the provisioning of these images into the hands of infrastructure organization, not by each and every disparate development project. - The QIP DNS infrastructure needs to be audited: 1) Revisit the overall Enterprise DNS architecture and make sure it still makes sense. 2) Ensure that the versions of 'bind' and 'dhcpd' deployed in QIP are new enough to overcome the known security holes of the older versions of these tools. 3) The competitive landscape should be revisited here to see if there a better/newer/cheaper integrated DNS solutions. 4) Examine the possibility of augmenting standard "bare" 'named' and 'dhcpd' with open source or commercial DNS/DHCP configuration tools. - The Windows server farm provides a strong opportunity for *consolidation*: 1) Many machines are lightly utilized and thus can be consolidated via virtualization. 2) The data on Windows server utilization is spotty at best. Instead of attempting to analytically determine which servers to virtualize, do so *empirically*, as follows: a) Select the servers that today represent the least powerful 20% Windows servers. b) Begin adding servers from that 20% virtually to a target machine *while monitoring utilization*. When the machine hosting the virtual server images reaches some threshold average utilization (we suggest 75%), consider it "full", and start adding virtual servers to the next physical machine. c) Over time, you will discover what a reasonable average level of utilization is for the machines hosting the virtual servers and thus how many such virtual images a given class of hardware can support. (The business case assumes consolidation ratios of 3:1, 2:1, and 1:1 for small, medium, and large class servers respectively.) - The Unix server farm offers some opportunities for *migration*: 1) SAP needs to be migrated to run on Linux instead of AIX. 2) The various flavors of Oracle currently in use need to be migrated to a high-availability Oracle RAC environment, running on Linux on either the existing Z-Series mainframe or a new farm of purpose-built Linux servers. - The Unix server farm offers some slight opportunity for *consolidation*: - There is a meaningful opportunity for Linux/Open Source in the Retail Store environment: - A detailed analysis/audit of the core FLEX pricing algorithms needs to be undertaken to determine whether more hardware or better algorithms (or both) can be brought to bear: 1) Need to determine the nature of the computational contraint. 2) We suspect the problem being solved is "NP-Complete". If so, there needs to be an investigation of improving/introducing bounding heuristics to improve computation speed. Other Scenarios Considered -------------------------- - Retail Store Server Consolidation 1) We examined the possibility of collapsing the 4 servers currently used in each store into 2 larger servers. 2) This scenario is currently a nonstarter because new hardware is still being rolled out to the stores this year. The cost recovery thus isn't there to justify a store server consoldiation. Major Risks ----------- - The absence of "before"SLA metrics means that any consolidations or other changes made to the system may get blamed for subsequently seen "poor" performance. When this happens there is no way to compare the "after" to the "before" conditions. Senior management needs to understand this and be prepared to manage through it.