What we are doing

Data is being generated at an unprecedented rate. The ability to process data in a scalable manner is a competitive differentiator for service providers and enterprises alike.

The Scalable Data Processing Activity at Bell Labs is concerned with exploring, understanding, and experimenting with the principles for the construction of scalable data processing systems for functional and analytic purposes. The Scalable Data Processing Activity is led by Thomas Woo. Its members are spread over multiple international locations.

Our research goal is to develop and apply techniques that allow us to construct systems with unlimited scalability.  Departing from traditional system design, a system with unlimited scalability is designed with no pre-defined engineering limits, such as the number of subscribers, or the amount of data per subscribers. When provided with sufficient underlying resources, a system with unlimited scalability can "expand" to meet the incoming load, without re-design.  A key to constructing systems with unlimited scalability is an elastic system architecture. An elastic system architecture allows its system components to grow and shrink on demand. To build truly elastic systems, attention must be put on both compute and data elasticity.

Our research approach consists of a mix of systems and theory. We strongly believe system and theory go hand in hand. Good system design is often rooted on a rigorous foundation.

Some example projects in our activity are:

Future Communications

Some of the current communication tools have been around for a long time. Voice is over 100 years old. Even the Web is more than 25 years old. In this project, we ask the question, what does the future of communication looks like in a post voice/email/Web world?

Actually, we don't need to look very far. We are already witnessing a generational shift in our communications. Today’s teenagers don’t call, they don’t email. Instead, they chat. Chatting has emerged as a significant form of communication with the rise of the mobile. More and more people are “living” in chat. We believe this trend will continue and chat will play a big part in the future of communications.

However, the chat of today is like the web in 1991, when the first web page was created. We believe there are significant innovations possible in the chat space. A key goal of the FutureComm project is to bring some of these innovations to chat. For example, most of today's chat is about static contents. Can we make it more dynamic? Also, the context of communications will become as important as the contents of communications. How can we extract the relevant contexts?

As a communication paradigm, chat has some powerful characteristics. Chat is instantaneous, chat is asynchronous, and chat is stateful. There are significant potential in leveraging these characteristics to create a new communication platform for the post voice/email/web world.

Stream Processing: Continuous Hive (CHive)

Bandwidth efficient execution of online big data analytics in telecommunication networks demands for tailored solutions. Existing streaming analytics systems are designed to operate in large data centers, assuming unlimted bandwidth between data center nodes. Applying these solutions unmodified to distributed telecommunication clouds, overlooks the fact that available bandwidth is a scarce and costly resource.

Continuous Hive (CHive) is a streaming analytics platform tailored for distributed telecommunication clouds. The fundamental contribution of CHive is that it optimizes query plans to minimize their overall bandwidth consumption when deployed in a distributed cloud.

Subscriber Data Management

Subscriber data is a key asset of a service provider. Traditionally, the key system concern is to be able to serve this efficiently for operational purposes. That by itself has become a challenge because of the amount of data, and the need for low latency operations.

Increasingly, there is significant interest in leveraging this data for analytic purposes, either for system improvements or new revenue opportunities. And such analytics are becoming more and more real-time. The needs for low latency and high reliability become even more challenging as these systems move from dedicated hardware to virtual machines and containers in the cloud.

In this project, we experiment with an integrated data layer design that can simultaneously address the operational, real-time analytics and offline analytics system requirements for scalability, elasticity, reliability, and response-time predictability.

Virtual Radio Access Network (VRAN)

This project investigates the use of Linux Container, a lightweight virtualization technology suitable for real-time applications in the wireless space.

Unlike most academic efforts, we experiment with a commercial eNodeB stack and created multiple instances using containers based on Linux Containers (LXC) and Docker. As part of the case study, we connected the virtualized eNodeBs to virtualized EPCs, creating multiple end-to-end LTE access networks.

A major focus of the project is the orchestration and configuration layer which allow the dynamic creation, deployment, and configuration of the wireless access network elements. Especially the mix of containers trending towards micro-service architectures and virtual machines provide new challenges for the orchestration and configuration layer.

Root Cause Analysis for Cloud

Root cause analysis (RCA) is critical to correct and reliable system operations. RCA for distributed system is a hard problem because of the local and remote dependencies. While a software or hardware failure in a node can propagate across these dependencies and result in severe degradation of the overall system performance, the localization of the original failure is not straightforward and requires adequate monitoring, alarming, and diagnosis tools.

With many applications now virtualized and running in cloud, RCA for cloud is an important topic. However, RCA for cloud can be much harder than RCA for even distributed systems. For example, resource sharing is a norm in cloud and can be a unpredictable source of performance degradation. Furthermore, many cloud apps are elastic in nature, just understanding their correct behavior and detection of anomalies is a challenge. Lastly, there can be new failure modes. For example, inappropriate placement or improper scaling can significantly affect overall system operation.

We address the above challenges through the complete chain of RCA functions which includes Monitoring, Anomaly Detection & Alarm Generation, Alarm Correlation & Fault Diagnosis, and Recovery actions & Notifications.