Newer
Older
s5demo / s5demo.txt
@tundra tundra on 5 Apr 2006 11 KB Initial revision
.. title::
  s5 DEMO!!!! - RandomCo Thoughts

.. footer::
  $Id: s5demo.txt,v 1.1 2006/04/05 17:11:27 tundra Exp $


.. contents::

Meeting Mechanics
-----------------

- It would be best if we could get all the way through the presentation
  in "Read Only" mode.  This will allow you to understand the entire
  analysis in context.  This will be followed by an open-ended QA session.

- Ed Murphy will present our findings using interprative dance.


Possible Responses
------------------

- As you hear us present today, you will have one of three responses:

   - "Oh, I/we already know that."
   - "Hmm, that's new to me - good point."
   - "I thoroughly disagree with what you just said."

  Each of these are important ways that you will validate and/or act upon
  our findings.

Assumptions
-----------

- A lot of baseline data such as machine inventories, levels of utilization,
  application response time, and arrival rate profiles was either missing,
  incomplete, or inconsistent across systems.  Some data (such as pricing)
  was completely unavailable to us.  We've thus built the models using
  estimates for certain critical datapoints.  RandomCo can update these models
  at will to make use of "real" data and thereby get better model output.


Key Findings
------------

- The RandomCo IT culture is end-user/project focused to a fault.
  Infrastructure is (mostly) enhanced incrementally and is not treated
  as a common asset designed to serve the entire breadth of
  applications.

- This has led to an overprovisioning of some classes of servers
  (Windows) and a greater variety of system types (Unix) than is
  strictly necessary.

- Operational disciplines such as asset inventory control, measurement,
  and reporting vary greatly in depth and quality.  This makes it hard
  to manage what is not consistently measured.

- Business, Architecture, Development, Infrastructure, and 
  Operations are not bound together with a common overarching view of
  IT at RandomCo.  There is a tendency for each of these to operate
  moreso as silos. The Architecture team tends to have the broadest
  view of these issues.

- Customizations to key business subsystems such as SAP and Manugistics
  are creating a high degree of operational complexity that may
  not be justified.

- The strict commitment to a Windows/.NET-only development environment
  is unnecessarily constraining the organization's agility, time-to-market,
  and ability to control costs.


Core Themes
-----------

- Infrastructure provisioning needs to move from a project-centric
  model to an Enterprise-wide service model.

  1) This will maximize reusability of extant infrastructure assets.

  2) This will enable a systemic perspective for provisioning new
     infrastructure with attendant economies of scale.

  3) The current complexity, variety, and underutilization of systems
     at RandomCo is a direct consequence of making project-based
     infrastructure decisions.  No amount of migration/consolidation
     will make a permanent difference if the underlying root cause
     practice that caused the situtation in the first place is not
     addressed.

- Measurement, Monitoring, and Management need to be made more
  consistent and reach more widely across the IT operational
  environment:

   1) Basic asset information such as server inventory, software
      revision levels, machine age, and so on varies widely by
      platform.  This makes business case cost calculations for new
      initiatives difficult, and in some cases, impossible.

   2) Today there is wide variation in the depth and quality of
      performance and capacity metrics available across all the
      datacenter assets.  This makes tuning and capacity planning a
      vertical, per-server activity (if at all), rather than a
      systemic infrastructure concern.

   3) In short, "You Cannot Manage What You Do Not Measure."

- RandomCo today is already making good use of virtualization
  in the "Big Box" Unix and Mainframe areas.  This needs to be 
  extended to the Windows-class servers as well.

  1) This will allow more efficient use of existing server capacity.

  2) This will enable rapid resource (re)provisioning on a project
     or even perhaps, event, basis.

  3) This will decouple applications software from underlying
     operating environments by testing and certifying the application
     to the *virtual OS*, not the physical hardware.  This will
     materially reduce the retesting burden currently incurred when
     hardware is upgraded or changed.

- RandomCo should begin the necessary steps to reduce the number of
  different Unix variants within the IT organization and reduce its
  total dependence on Windows as a server platform.  Wherever
  possible, these should be migrated to SLES Linux across the required
  breadth of hardware.  RandomCo will benefit in doing so because:

  1) This creates a common operational platform thereby reducing
     training cost and maximally leveraging the employee skill set.

  2) This make the organization hardware-agnostic thereby providing
     negotiation leverage with the hardware vendors.

  3) The net software licensing cost should drop significantly:

     a) RandomCo already has an Enterprise License for SLES.

     b) SLES will be bundled with XEN virtualization in future
        releases.  This should be considerably less expensive than the
        separate licensing of Windows and VMWare on today's servers.

  4) The first candidate for elimination is AIX.


- RandomCo needs to embrace Linux as a development platform for its own
  customized software:

   1) This will give it many more degrees of freedom in how it designs,
      deploys, and operates its own applications.

   2) This will open the door to cost reduction by replacing 
      expensive enabling components (like IIS) with free or very
      inexpensive open source equivalents (like Apache).

   3) This will enable "scale" at the *organizational* level.  Today,
      there is a significant difference in worldview, skillset, and
      approach between the Windows developers and the rest of the
      RandomCo IT community.  By moving to make Linux one of the common
      development platforms, RandomCo will open the door to having the
      in-house applications it develops run on everything from an
      entry-level machine through an Enterprise-class mainframe.  This
      will be done with a common set of development tools,
      technologies, and *people* across the organization, with a far
      stronger alignment between Architecture, Development, and
      Operations.


Specific Technical Recommendations
----------------------------------

- There are a number areas for improvement that are "Quit Hits".
  These are relatively low risk/ low complexity and can be acted
  upon fairly quickly:

    1) Audit all printers and replace any that are still using PC
       print server hosts with direct network connected printers.

    2) Migrate the datacenter core LAN fabric from 100 BaseT to
       Gigabit ethernet everywhere.

    3) Build out the datacenter switch topology to accommodate more
       ports, be 1G capable, and accommodate future growth.  Get rid
       of the daisy-chained switches used today.

    4) Build an IP-connected NAS in the datacenter and migrate all the
       corporate file servers away from locally attached storage to
       the NAS to provide consolidated storage, backup, management, &
       recovery.  (It may be the case that it is easier/more consistent
       to actually mount this on the existing SAN and expand the
       SAN capacity accordingly.)

    5) Continue/accelerate the path to virtualizing Dev/Test/QA
       images.  BUT, place the provisioning of these images into the
       hands of infrastructure organization, not by each and every
       disparate development project.

- The QIP DNS infrastructure needs to be audited:

    1) Revisit the overall Enterprise DNS architecture and make sure
       it still makes sense.

    2) Ensure that the versions of 'bind' and 'dhcpd' deployed in QIP
       are new enough to overcome the known security holes of the
       older versions of these tools.

    3) The competitive landscape should be revisited here to see if there
       a better/newer/cheaper integrated DNS solutions.

    4) Examine the possibility of augmenting standard "bare" 'named' and
       'dhcpd' with open source or commercial DNS/DHCP configuration tools.

- The Windows server farm provides a strong opportunity for *consolidation*:

    1) Many machines are lightly utilized and thus can be consolidated
       via virtualization.

    2) The data on Windows server utilization is spotty at best.  Instead
       of attempting to analytically determine which servers to virtualize,
       do so *empirically*, as follows:

       a) Select the servers that today represent the least powerful
          20% Windows servers.

       b) Begin adding servers from that 20% virtually to a target
          machine *while monitoring utilization*.  When the machine
          hosting the virtual server images reaches some threshold
          average utilization (we suggest 75%), consider it "full",
          and start adding virtual servers to the next physical machine.

       c) Over time, you will discover what a reasonable average
          level of utilization is for the machines hosting the virtual
          servers and thus how many such virtual images a given class of
          hardware can support.  (The business case assumes consolidation
          ratios of 3:1, 2:1, and 1:1 for small, medium, and large class
          servers respectively.)

- The Unix server farm offers some opportunities for *migration*:

    1) SAP needs to be migrated to run on Linux instead of AIX.

    2) The various flavors of Oracle currently in use need to be
       migrated to a high-availability Oracle RAC environment, running
       on Linux on either the existing Z-Series mainframe or a new
       farm of purpose-built Linux servers.

- The Unix server farm offers some slight opportunity for *consolidation*:

    <Niel needs to fill this in as regards to the non-SAP AIX servers
     and their consolidation.>

- There is a meaningful opportunity for Linux/Open Source in the Retail
  Store environment:

     <Niel/Tom need to fill this in here>

- A detailed analysis/audit of the core FLEX pricing algorithms needs
  to be undertaken to determine whether more hardware or better
  algorithms (or both) can be brought to bear:

    1) Need to determine the nature of the computational contraint.

    2) We suspect the problem being solved is "NP-Complete".  If so,
       there needs to be an investigation of improving/introducing
       bounding heuristics to improve computation speed.


Other Scenarios Considered
--------------------------

- Retail Store Server Consolidation

  1) We examined the possibility of collapsing the 4 servers currently
     used in each store into 2 larger servers.

  2) This scenario is currently a nonstarter because new hardware is
     still being rolled out to the stores this year.  The cost recovery
     thus isn't there to justify a store server consoldiation.


Major Risks
-----------

- The absence of "before"SLA metrics means that any consolidations or
  other changes made to the system may get blamed for subsequently
  seen "poor" performance.  When this happens there is no way to
  compare the "after" to the "before" conditions.  Senior management
  needs to understand this and be prepared to manage through it.