Service Level Agreement (SLA) and Support Policy¶
Policy Status: Draft
This policy is currently draft.
Purpose¶
To define the expected service levels and support standards provided by the IT department to internal stakeholders, establishing clear expectations for system availability, performance, response times, and support services while ensuring accountability and continuous service improvement.
Scope¶
This policy applies to: - All IT services and systems provided by Acme Corp - Internal support functions for employees and contractors - System availability and performance commitments - Incident response and problem resolution - Support request handling and prioritization - Service desk operations
This policy covers support for: - End-user computing (laptops, desktops, mobile devices) - Business applications and systems - Network and connectivity - Email and collaboration tools - Cloud services and infrastructure - Security and access issues
Policy Statement¶
Service Availability Commitments¶
Acme Corp commits to the following system availability targets:
Critical Systems (Student Portal, Authentication, Core Database): - Availability Target: 99.9% uptime (maximum 43 minutes downtime per month) - Measurement Period: Monthly - Exclusions: Scheduled maintenance windows, force majeure events
Essential Systems (Email, Collaboration Tools, CRM): - Availability Target: 99.5% uptime (maximum 3.6 hours downtime per month) - Measurement Period: Monthly - Exclusions: Scheduled maintenance windows
Standard Systems (Internal tools, reporting systems): - Availability Target: 99% uptime (maximum 7.2 hours downtime per month) - Measurement Period: Monthly - Exclusions: Scheduled maintenance windows
Scheduled Maintenance: - Does not count against availability targets - Performed during designated maintenance windows - Communicated 48 hours in advance minimum
Performance Standards¶
System performance targets:
Response Time: - Critical Systems: < 2 seconds for 95% of requests - Essential Systems: < 3 seconds for 95% of requests - Standard Systems: < 5 seconds for 95% of requests
Transaction Processing: - Batch Jobs: Complete within scheduled windows - Real-time Processing: < 1 second for 95% of transactions - Report Generation: < 30 seconds for standard reports
Data Synchronization: - Real-time Systems: < 5 second delay - Near Real-time Systems: < 5 minute delay - Batch Synchronization: Within scheduled timeframes
Incident Priority Levels¶
Incidents classified by business impact and urgency:
Priority 1 - Critical: - Definition: Complete system outage or critical security breach affecting all users or compromising sensitive data - Examples: Student portal down, authentication failure, data breach, complete network outage - Response Time: 15 minutes - Resolution Target: 4 hours - Communication: Hourly updates
Priority 2 - High: - Definition: Major functionality impaired affecting multiple users or departments - Examples: Partial system outage, performance degradation, integration failures - Response Time: 1 hour - Resolution Target: 8 hours (within same business day) - Communication: Updates every 4 hours
Priority 3 - Medium: - Definition: Moderate impact to individual users or non-critical systems - Examples: Single user access issues, minor feature malfunction, non-critical system issues - Response Time: 4 hours - Resolution Target: 24 hours (1 business day) - Communication: Daily updates
Priority 4 - Low: - Definition: Minor issues with workarounds available or enhancement requests - Examples: Cosmetic issues, feature requests, minor inconveniences - Response Time: 24 hours (1 business day) - Resolution Target: 5 business days - Communication: Updates as needed
Support Hours and Availability¶
Business Hours Support (Primary): - Hours: Monday - Friday, 8:00 AM - 6:00 PM EST - Coverage: Full IT support team available - Channels: Email, phone, Slack, help desk portal - Response: Per SLA response times
Extended Hours Support: - Hours: Monday - Friday, 6:00 PM - 10:00 PM EST - Coverage: Limited on-call support for Priority 1 and 2 incidents - Channels: Phone, email (monitored hourly), Slack - Response: Priority 1 within 30 minutes, Priority 2 within 2 hours
Weekend and Holiday Support: - Hours: Saturday, Sunday, and company holidays - Coverage: On-call engineer for Priority 1 incidents only - Channels: Emergency phone line - Response: Priority 1 within 1 hour
24/7 Critical System Monitoring: - Automated monitoring and alerting active 24/7 - On-call rotation for critical system issues - Escalation procedures for after-hours incidents
Escalation Procedures¶
Incidents escalated when not resolved within SLA targets:
Level 1 - Help Desk: - Initial point of contact for all support requests - Handle common issues and requests - Escalate to Level 2 after 50% of resolution target time
Level 2 - Technical Specialists: - Subject matter experts for specific systems - Handle complex technical issues - Escalate to Level 3 after 75% of resolution target time
Level 3 - Senior Engineers/Architects: - Senior technical staff and system architects - Handle critical issues and complex problems - Escalate to management for resource decisions
Management Escalation: - IT Operations Manager: For Priority 1 and 2 incidents not resolved within SLA - CTO: For Priority 1 incidents exceeding 4 hours or requiring executive decisions
Automated Escalation: - Automatic escalation when SLA targets at risk - Notifications sent to relevant parties - Escalation tracking in ticketing system
User Responsibilities¶
Users are expected to:
- Accurate Information: Provide complete and accurate information when reporting issues
- Timely Reporting: Report issues promptly when discovered
- Cooperation: Cooperate with IT staff during troubleshooting
- Priority Honesty: Accurately represent incident priority (not inflate urgency)
- Testing Assistance: Participate in testing and validation when requested
- Documentation Review: Review and follow available documentation before requesting support
- Authorized Requests: Only submit requests for systems/services they're authorized to use
- Feedback: Provide feedback on support quality through surveys
Service Request Management¶
Request Types: - Incident: Something is broken or not working as expected - Service Request: Request for standard service (access, equipment, software) - Change Request: Request to modify systems or configurations - Problem: Recurring incidents requiring root cause investigation - Enhancement: Request for new features or capabilities
Request Submission: - Primary: Help desk portal (preferred method) - Email: help@acmecorp.com - Phone: (555) 123-4567 - Slack: #it-support channel - In-person: IT office (non-urgent requests)
Required Information: - Clear description of issue or request - Business impact and urgency - Affected users or systems - Steps to reproduce (for incidents) - Screenshots or error messages when applicable - Contact information for follow-up
Roles and Responsibilities¶
| Role | Responsibility |
|---|---|
| Chief Technology Officer | Overall accountability for service levels, approve SLA targets, review performance |
| IT Operations Manager | Manage support operations, ensure SLA compliance, oversee escalations |
| Help Desk Manager | Lead help desk team, monitor ticket queue, ensure response time compliance |
| Help Desk Analysts | First-line support, ticket triage, resolution of common issues |
| Technical Specialists | Second-line support, resolve complex technical issues |
| On-Call Engineers | After-hours support for critical incidents, escalation point |
| System Administrators | Maintain systems to meet availability targets, participate in incident resolution |
| Users/Employees | Report issues accurately and promptly, cooperate with IT during resolution |
| Department Managers | Communicate department needs, participate in service reviews |
Procedures¶
1. Incident Response Process¶
1.1 Incident Report¶
- User submits incident through approved channel
- Help desk creates ticket in ticketing system
- Ticket automatically assigned unique ID
1.2 Initial Triage¶
Within 15 minutes:
- Help desk analyst reviews incident
- Assigns priority level based on impact and urgency
- Acknowledges receipt to user
- SLA clock starts
1.3 Investigation and Diagnosis¶
- Gather additional information from user if needed
- Review system logs and monitoring data
- Attempt to reproduce issue
- Check knowledge base for similar incidents
- Consult with technical specialists if needed
1.4 Resolution¶
- Implement fix or workaround
- Test resolution to verify issue resolved
- Document resolution steps in ticket
- Update knowledge base if new solution
1.5 User Verification¶
- Contact user to verify resolution
- Confirm user can access/use system normally
- Ensure no related issues
1.6 Ticket Closure¶
- Document final resolution
- Update ticket status to resolved
- Request user feedback via survey
- Close ticket
1.7 Follow-up¶
If applicable:
- Schedule follow-up for temporary workarounds
- Create problem ticket for recurring issues
- Update documentation or training materials
2. Service Request Fulfillment¶
2.1 Request Submission¶
- User submits service request
- Request logged in ticketing system
2.2 Request Validation¶
- Verify user authorized to make request
- Confirm request aligns with policies
- Check for required approvals
2.3 Approval Process¶
If needed:
- Route to appropriate approver (manager, IT leadership)
- Await approval decision
- Notify user of approval status
2.4 Fulfillment¶
- Provision access, equipment, or service
- Configure according to standards
- Test functionality before delivery
2.5 Delivery¶
- Deliver service to user
- Provide necessary documentation or training
- Obtain user acknowledgment
2.6 Closure¶
- Confirm user satisfaction
- Update asset/access records
- Close request ticket
3. Priority 1 Critical Incident Response¶
Special procedures for critical incidents:
3.1 Immediate Response¶
Within 15 minutes:
- Acknowledge incident
- Notify IT Operations Manager immediately
- Assemble incident response team
- Begin initial assessment
3.2 Communication¶
- Post incident notification to status page
- Send initial communication to affected users
- Establish communication channel (Slack #incident-response)
- Assign communication coordinator
3.3 Investigation and Containment¶
- Identify root cause
- Implement containment measures
- Engage vendor support if needed
- Document all actions taken
3.4 Resolution¶
- Implement fix or failover to backup systems
- Verify system functionality restored
- Monitor for stability
3.5 Recovery¶
- Gradually restore normal operations
- Validate data integrity
- Confirm all integrations working
3.6 Communication Updates¶
- Hourly status updates to stakeholders
- Update status page with current status
- Final communication when resolved
3.7 Post-Incident Review¶
Within 48 hours:
- Conduct incident retrospective
- Document timeline and actions taken
- Identify root cause
- Create action items to prevent recurrence
- Update procedures and documentation
4. SLA Performance Monitoring¶
Monthly SLA performance review:
4.1 Data Collection¶
- Export ticket data from ticketing system
- Gather uptime data from monitoring systems
- Collect performance metrics
- Compile user satisfaction scores
4.2 Metric Calculation¶
- System availability percentages
- Response time compliance by priority
- Resolution time compliance by priority
- Average resolution time
- First contact resolution rate
- User satisfaction scores
- Escalation frequency
4.3 Analysis¶
- Compare actual vs. target performance
- Identify trends and patterns
- Highlight areas of concern
- Recognize areas of excellence
4.4 Reporting¶
- Generate monthly SLA report
- Present to IT leadership
- Share summary with department heads
- Post metrics to internal dashboard
4.5 Improvement Actions¶
- Create action plans for underperforming areas
- Adjust resources if needed
- Update procedures or training
- Implement process improvements
5. Escalation Management¶
When incident requires escalation:
5.1 Automated Escalation Trigger¶
- Ticketing system monitors time elapsed
- Sends escalation notification at defined thresholds:
- Priority 1: 50% of target (2 hours)
- Priority 2: 60% of target (5 hours)
- Priority 3: 75% of target (18 hours)
5.2 Manual Escalation¶
- Analyst determines issue requires higher-level support
- Reassigns ticket to Level 2 or Level 3 team
- Updates ticket with escalation reason
- Notifies receiving team
5.3 Management Escalation¶
- IT Operations Manager notified for SLA-at-risk incidents
- Manager assesses need for additional resources
- CTO notified for Priority 1 incidents exceeding 4 hours
- Executive decision-making for major incidents
5.4 Escalation Communication¶
- User notified of escalation
- Stakeholders updated on status
- Ticket notes updated with escalation details
Exceptions¶
SLA exceptions may apply for:
- Force Majeure: Natural disasters, utility failures, internet provider outages
- Security Incidents: Response to active security threats may override normal SLAs
- Third-Party Dependencies: Issues caused by external service providers
- User Error: Issues resulting from user actions outside policy
- Scheduled Maintenance: Planned maintenance windows
- End of Life Systems: Systems scheduled for decommissioning (with advance notice)
Exception process: - Document exception circumstances in incident ticket - IT Operations Manager approval required - User notification of exception and revised timeline - Exceptions reviewed in monthly SLA report
Compliance and Enforcement¶
- Real-Time Monitoring: Ticketing system tracks SLA compliance in real-time
- Automated Alerts: Notifications when SLAs at risk
- Monthly Reporting: Detailed SLA compliance reports
- Key Metrics Tracked:
- System availability percentage (target: meet SLA commitments)
- Response time compliance (target: >95%)
- Resolution time compliance (target: >90%)
- User satisfaction score (target: >4.0/5.0)
- First contact resolution rate (target: >70%)
- Escalation rate (target: <10%)
- Quarterly Reviews: Comprehensive service review with stakeholders
- Annual SLA Review: Evaluate and adjust SLA targets based on performance and needs
- Continuous Improvement: Regular process improvements based on metrics and feedback
- User Feedback: Post-resolution surveys to measure satisfaction
References¶
- ITIL Service Operation - Incident Management
- ITIL Service Operation - Service Level Management
- ISO/IEC 20000: IT Service Management
- SOC 2 Trust Service Criteria: Availability
- HIPAA Security Rule - Contingency Planning and Response
Revision History¶
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2025-11-08 | IT Team | Initial version migrated from Notion |
Document Control - Classification: Internal - Distribution: All employees, IT team, department heads - Storage: GitHub repository - policy-repository