s5 DEMO!!!! - RandomCo Thoughts
$Id: s5demo.txt,v 1.1 2006/04/05 17:11:27 tundra Exp $
- It would be best if we could get all the way through the presentation
in "Read Only" mode. This will allow you to understand the entire
analysis in context. This will be followed by an open-ended QA session.
- Ed Murphy will present our findings using interprative dance.
- As you hear us present today, you will have one of three responses:
- "Oh, I/we already know that."
- "Hmm, that's new to me - good point."
- "I thoroughly disagree with what you just said."
Each of these are important ways that you will validate and/or act upon
- A lot of baseline data such as machine inventories, levels of utilization,
application response time, and arrival rate profiles was either missing,
incomplete, or inconsistent across systems. Some data (such as pricing)
was completely unavailable to us. We've thus built the models using
estimates for certain critical datapoints. RandomCo can update these models
at will to make use of "real" data and thereby get better model output.
- The RandomCo IT culture is end-user/project focused to a fault.
Infrastructure is (mostly) enhanced incrementally and is not treated
as a common asset designed to serve the entire breadth of
- This has led to an overprovisioning of some classes of servers
(Windows) and a greater variety of system types (Unix) than is
- Operational disciplines such as asset inventory control, measurement,
and reporting vary greatly in depth and quality. This makes it hard
to manage what is not consistently measured.
- Business, Architecture, Development, Infrastructure, and
Operations are not bound together with a common overarching view of
IT at RandomCo. There is a tendency for each of these to operate
moreso as silos. The Architecture team tends to have the broadest
view of these issues.
- Customizations to key business subsystems such as SAP and Manugistics
are creating a high degree of operational complexity that may
not be justified.
- The strict commitment to a Windows/.NET-only development environment
is unnecessarily constraining the organization's agility, time-to-market,
and ability to control costs.
- Infrastructure provisioning needs to move from a project-centric
model to an Enterprise-wide service model.
1) This will maximize reusability of extant infrastructure assets.
2) This will enable a systemic perspective for provisioning new
infrastructure with attendant economies of scale.
3) The current complexity, variety, and underutilization of systems
at RandomCo is a direct consequence of making project-based
infrastructure decisions. No amount of migration/consolidation
will make a permanent difference if the underlying root cause
practice that caused the situtation in the first place is not
- Measurement, Monitoring, and Management need to be made more
consistent and reach more widely across the IT operational
1) Basic asset information such as server inventory, software
revision levels, machine age, and so on varies widely by
platform. This makes business case cost calculations for new
initiatives difficult, and in some cases, impossible.
2) Today there is wide variation in the depth and quality of
performance and capacity metrics available across all the
datacenter assets. This makes tuning and capacity planning a
vertical, per-server activity (if at all), rather than a
systemic infrastructure concern.
3) In short, "You Cannot Manage What You Do Not Measure."
- RandomCo today is already making good use of virtualization
in the "Big Box" Unix and Mainframe areas. This needs to be
extended to the Windows-class servers as well.
1) This will allow more efficient use of existing server capacity.
2) This will enable rapid resource (re)provisioning on a project
or even perhaps, event, basis.
3) This will decouple applications software from underlying
operating environments by testing and certifying the application
to the *virtual OS*, not the physical hardware. This will
materially reduce the retesting burden currently incurred when
hardware is upgraded or changed.
- RandomCo should begin the necessary steps to reduce the number of
different Unix variants within the IT organization and reduce its
total dependence on Windows as a server platform. Wherever
possible, these should be migrated to SLES Linux across the required
breadth of hardware. RandomCo will benefit in doing so because:
1) This creates a common operational platform thereby reducing
training cost and maximally leveraging the employee skill set.
2) This make the organization hardware-agnostic thereby providing
negotiation leverage with the hardware vendors.
3) The net software licensing cost should drop significantly:
a) RandomCo already has an Enterprise License for SLES.
b) SLES will be bundled with XEN virtualization in future
releases. This should be considerably less expensive than the
separate licensing of Windows and VMWare on today's servers.
4) The first candidate for elimination is AIX.
- RandomCo needs to embrace Linux as a development platform for its own
1) This will give it many more degrees of freedom in how it designs,
deploys, and operates its own applications.
2) This will open the door to cost reduction by replacing
expensive enabling components (like IIS) with free or very
inexpensive open source equivalents (like Apache).
3) This will enable "scale" at the *organizational* level. Today,
there is a significant difference in worldview, skillset, and
approach between the Windows developers and the rest of the
RandomCo IT community. By moving to make Linux one of the common
development platforms, RandomCo will open the door to having the
in-house applications it develops run on everything from an
entry-level machine through an Enterprise-class mainframe. This
will be done with a common set of development tools,
technologies, and *people* across the organization, with a far
stronger alignment between Architecture, Development, and
Specific Technical Recommendations
- There are a number areas for improvement that are "Quit Hits".
These are relatively low risk/ low complexity and can be acted
upon fairly quickly:
1) Audit all printers and replace any that are still using PC
print server hosts with direct network connected printers.
2) Migrate the datacenter core LAN fabric from 100 BaseT to
Gigabit ethernet everywhere.
3) Build out the datacenter switch topology to accommodate more
ports, be 1G capable, and accommodate future growth. Get rid
of the daisy-chained switches used today.
4) Build an IP-connected NAS in the datacenter and migrate all the
corporate file servers away from locally attached storage to
the NAS to provide consolidated storage, backup, management, &
recovery. (It may be the case that it is easier/more consistent
to actually mount this on the existing SAN and expand the
SAN capacity accordingly.)
5) Continue/accelerate the path to virtualizing Dev/Test/QA
images. BUT, place the provisioning of these images into the
hands of infrastructure organization, not by each and every
disparate development project.
- The QIP DNS infrastructure needs to be audited:
1) Revisit the overall Enterprise DNS architecture and make sure
it still makes sense.
2) Ensure that the versions of 'bind' and 'dhcpd' deployed in QIP
are new enough to overcome the known security holes of the
older versions of these tools.
3) The competitive landscape should be revisited here to see if there
a better/newer/cheaper integrated DNS solutions.
4) Examine the possibility of augmenting standard "bare" 'named' and
'dhcpd' with open source or commercial DNS/DHCP configuration tools.
- The Windows server farm provides a strong opportunity for *consolidation*:
1) Many machines are lightly utilized and thus can be consolidated
2) The data on Windows server utilization is spotty at best. Instead
of attempting to analytically determine which servers to virtualize,
do so *empirically*, as follows:
a) Select the servers that today represent the least powerful
20% Windows servers.
b) Begin adding servers from that 20% virtually to a target
machine *while monitoring utilization*. When the machine
hosting the virtual server images reaches some threshold
average utilization (we suggest 75%), consider it "full",
and start adding virtual servers to the next physical machine.
c) Over time, you will discover what a reasonable average
level of utilization is for the machines hosting the virtual
servers and thus how many such virtual images a given class of
hardware can support. (The business case assumes consolidation
ratios of 3:1, 2:1, and 1:1 for small, medium, and large class
- The Unix server farm offers some opportunities for *migration*:
1) SAP needs to be migrated to run on Linux instead of AIX.
2) The various flavors of Oracle currently in use need to be
migrated to a high-availability Oracle RAC environment, running
on Linux on either the existing Z-Series mainframe or a new
farm of purpose-built Linux servers.
- The Unix server farm offers some slight opportunity for *consolidation*:
<Niel needs to fill this in as regards to the non-SAP AIX servers
and their consolidation.>
- There is a meaningful opportunity for Linux/Open Source in the Retail
<Niel/Tom need to fill this in here>
- A detailed analysis/audit of the core FLEX pricing algorithms needs
to be undertaken to determine whether more hardware or better
algorithms (or both) can be brought to bear:
1) Need to determine the nature of the computational contraint.
2) We suspect the problem being solved is "NP-Complete". If so,
there needs to be an investigation of improving/introducing
bounding heuristics to improve computation speed.
Other Scenarios Considered
- Retail Store Server Consolidation
1) We examined the possibility of collapsing the 4 servers currently
used in each store into 2 larger servers.
2) This scenario is currently a nonstarter because new hardware is
still being rolled out to the stores this year. The cost recovery
thus isn't there to justify a store server consoldiation.
- The absence of "before"SLA metrics means that any consolidations or
other changes made to the system may get blamed for subsequently
seen "poor" performance. When this happens there is no way to
compare the "after" to the "before" conditions. Senior management
needs to understand this and be prepared to manage through it.