The status quo in visual recognition is to learn from batches of unrelated Web photos labeled by human annotators. Yet cognitive science tells us that perception develops in the context of acting and moving in the world---and without intensive supervision. Meanwhile, many realistic vision tasks require not just categorizing a well-composed human-taken photo, but also intelligently deciding where to look in order to get a meaningful observation in the first place. In the context of these challenges, we are exploring ways to learn visual representations from unlabeled video accompanied by multi-modal sensory data like egomotion and sound. Moving from passively captured video to agents that control their own cameras, we investigate the problem of how to move to intelligently acquire visual observations. In particular, we introduce policy learning approaches for active look-around behavior---both for the sake of a specific recognition task as well as for generic exploratory behavior.