Pandas Python For Mac



The official home of the Python Programming Language. While Javascript is not essential for this website, your interaction with the content will be limited. The pandas library will install on Windows, Mac, and Linux via pip 4. Mac and Windows users wishing to install binaries may download them from the pandas website. Most Linux distributions also have native packages pre-built and available in their repos. On Ubuntu and Debian. As machine learning becomes more prevalent, Python has emerged as a scientific language. Within Python, NumPy and pandas are essential for any scientific computation. Understanding how these elements work together is critical for the aspiring data scientist. Pandas is an data analysis module for the Python programming language. It is open-source and BSD-licensed. Pandas is used in a wide range of fields including academia, finance, economics, statistics, analytics, etc. Related course: Data Analysis with Python Pandas.

Author

Data Analysis with Pandas and Python. This is one of the best courses to learn Data Analysis with Pandas and Python. This is both hand-on and interactive. Course Material is also well organized, and delivery is fantastic. However, It would have been even better if there is a project to get an idea of real-life exercises.

Bob Savage <bobsavage@mac.com>

Python on a Macintosh running Mac OS X is in principle very similar to Python onany other Unix platform, but there are a number of additional features such asthe IDE and the Package Manager that are worth pointing out.

4.1. Getting and Installing MacPython¶

Mac OS X 10.8 comes with Python 2.7 pre-installed by Apple. If you wish, youare invited to install the most recent version of Python 3 from the Pythonwebsite (https://www.python.org). A current “universal binary” build of Python,which runs natively on the Mac’s new Intel and legacy PPC CPU’s, is availablethere.

What you get after installing is a number of things:

  • A Python3.9 folder in your Applications folder. In hereyou find IDLE, the development environment that is a standard part of officialPython distributions; and PythonLauncher, which handles double-clicking Pythonscripts from the Finder.

  • A framework /Library/Frameworks/Python.framework, which includes thePython executable and libraries. The installer adds this location to your shellpath. To uninstall MacPython, you can simply remove these three things. Asymlink to the Python executable is placed in /usr/local/bin/.

The Apple-provided build of Python is installed in/System/Library/Frameworks/Python.framework and /usr/bin/python,respectively. You should never modify or delete these, as they areApple-controlled and are used by Apple- or third-party software. Remember thatif you choose to install a newer Python version from python.org, you will havetwo different but functional Python installations on your computer, so it willbe important that your paths and usages are consistent with what you want to do.

IDLE includes a help menu that allows you to access Python documentation. If youare completely new to Python you should start reading the tutorial introductionin that document.

If you are familiar with Python on other Unix platforms you should read thesection on running Python scripts from the Unix shell.

4.1.1. How to run a Python script¶

Your best way to get started with Python on Mac OS X is through the IDLEintegrated development environment, see section The IDE and use the Help menuwhen the IDE is running.

If you want to run Python scripts from the Terminal window command line or fromthe Finder you first need an editor to create your script. Mac OS X comes with anumber of standard Unix command line editors, vim andemacs among them. If you want a more Mac-like editor,BBEdit or TextWrangler from Bare Bones Software (seehttp://www.barebones.com/products/bbedit/index.html) are good choices, as isTextMate (see https://macromates.com/). Other editors includeGvim (http://macvim-dev.github.io/macvim/) and Aquamacs(http://aquamacs.org/).

To run your script from the Terminal window you must make sure that/usr/local/bin is in your shell search path.

To run your script from the Finder you have two options:

  • Drag it to PythonLauncher

  • Select PythonLauncher as the default application to open yourscript (or any .py script) through the finder Info window and double-click it.PythonLauncher has various preferences to control how your script islaunched. Option-dragging allows you to change these for one invocation, or useits Preferences menu to change things globally.

4.1.2. Running scripts with a GUI¶

With older versions of Python, there is one Mac OS X quirk that you need to beaware of: programs that talk to the Aqua window manager (in other words,anything that has a GUI) need to be run in a special way. Use pythonwinstead of python to start such scripts.

With Python 3.9, you can use either python or pythonw.

4.1.3. Configuration¶

Python on OS X honors all standard Unix environment variables such asPYTHONPATH, but setting these variables for programs started from theFinder is non-standard as the Finder does not read your .profile or.cshrc at startup. You need to create a file~/.MacOSX/environment.plist. See Apple’s Technical Document QA1067 fordetails.

For more information on installation Python packages in MacPython, see sectionInstalling Additional Python Packages.

4.2. The IDE¶

MacPython ships with the standard IDLE development environment. A goodintroduction to using IDLE can be found athttp://www.hashcollision.org/hkn/python/idle_intro/index.html.

4.3. Installing Additional Python Packages¶

There are several methods to install additional Python packages:

  • Packages can be installed via the standard Python distutils mode (pythonsetup.pyinstall).

  • Many packages can also be installed via the setuptools extensionor pip wrapper, see https://pip.pypa.io/.

4.4. GUI Programming on the Mac¶

There are several options for building GUI applications on the Mac with Python.

PyObjC is a Python binding to Apple’s Objective-C/Cocoa framework, which isthe foundation of most modern Mac development. Information on PyObjC isavailable from https://pypi.org/project/pyobjc/.

The standard Python GUI toolkit is tkinter, based on the cross-platformTk toolkit (https://www.tcl.tk). An Aqua-native version of Tk is bundled with OSX by Apple, and the latest version can be downloaded and installed fromhttps://www.activestate.com; it can also be built from source.

wxPython is another popular cross-platform GUI toolkit that runs natively onMac OS X. Packages and documentation are available from https://www.wxpython.org.

PyQt is another popular cross-platform GUI toolkit that runs natively on MacOS X. More information can be found athttps://riverbankcomputing.com/software/pyqt/intro.

4.5. Distributing Python Applications on the Mac¶

The standard tool for deploying standalone Python applications on the Mac ispy2app. More information on installing and using py2app can be foundat http://undefined.org/python/#py2app.

4.6. Other Resources¶

The MacPython mailing list is an excellent support resource for Python users anddevelopers on the Mac:

Another useful resource is the MacPython wiki:

Pandas is one of the most popular Python libraries for Data Science and Analytics. I like to say it’s the “SQL of Python.” Why? Because pandas helps you to manage two-dimensional data tables in Python. Of course, it has many more features. In this pandas tutorial series, I’ll show you the most important (that is, the most often used) things that you have to know as an Analyst or a Data Scientist. This is the first episode and we will start from the basics!

Note 1: this is a hands-on tutorial, so I recommend doing the coding part with me!

Before we start

If you haven’t done so yet, I recommend going through these articles first:

To follow this pandas tutorial…

  1. You will need a fully functioning data server with Python3, numpy and pandas on it.
    Note 1 : Again, with this tutorial you can set up your data server and Python3. And with this article you can set up numpy and pandas, too.
    Note 2: or take this step-by-step data server set up video course.
  2. Next step: log in to your server and fire up Jupyter. Then open a new Jupyter Notebook in your favorite browser. (If you don’t know how to do that, I really do recommend going through the articles I linked in the “Before we start” section.)
    Note: I’ll also rename my Jupyter Notebook to “pandas_tutorial_1”.
  3. Import numpy and pandas to your Jupyter Notebook by running these two lines in a cell:

    Note: It’s conventional to refer to ‘pandas’ as ‘pd’. When you add the as pd at the end of your import statement, your Jupyter Notebook understands that from this point on every time you type pd, you are actually referring to the pandas library.

Okay, now we have everything! Let’s start with this pandas tutorial!
The first question is:

How to open data files in pandas

You might have your data in .csv files or SQL tables. Maybe Excel files. Or .tsv files. Or something else. But the goal is the same in all cases. If you want to analyze that data using pandas, the first step will be to read it into a data structure that’s compatible with pandas.

Pandas data structures

There are two types of data structures in pandas: Series and DataFrames.

Series: a pandas Series is a one dimensional data structure (“a one dimensional ndarray”) that can store values — and for every value it holds a unique index, too.

Pandas Series example

DataFrame: a pandas DataFrame is a two (or more) dimensional data structure – basically a table with rows and columns. The columns have names and the rows have indexes.

In this pandas tutorial, I’ll focus mostly on DataFrames. The reason is simple: most of the analytical methods I will talk about will make more sense in a 2D datatable than in a 1D array.

Loading a .csv file into a pandas DataFrame

Okay, time to put things into practice! Let’s load a .csv data file into pandas!
There is a function for it, called read_csv().

Start with a simple demo data set, called zoo! This time – for the sake of practicing – you will create a .csv file for yourself! Here’s the raw data:

Go back to your Jupyter Home tab and create a new text file…

…then copy-paste the above zoo data into this text file…

… and then rename this text file to zoo.csv!

Okay, this is our .csv file.
Now, go back to your Jupyter Notebook (that I named ‘pandas_tutorial_1’) and open this freshly created .csv file in it!

Again, the function that you have to use is: read_csv()
Type this to a new cell:

pd.read_csv('zoo.csv', delimiter = ',')

And there you go! This is the zoo.csv data file, brought to pandas. This nice 2D table? Well, this is a pandas dataframe. The numbers on the left are the indexes. And the column names on the top are picked up from the first row of our zoo.csv file.

To be honest, though, you will probably never create a .csv data file for yourself, like we just did… you will use pre-existing data files. So you have to learn how to download .csv files to your server!

If you are here from the Junior Data Scientist’s First Month video course then you have already dealt with downloading your .txt or .csv data files to your data server, so you must be pretty proficient in it… But if you are not here from the course (or if you want to learn another way to download a .csv file to your server and to get another exciting dataset), follow these steps:

I’ve uploaded a small sample dataset here: DATASET

(Link: 46.101.230.157/dilan/pandas_tutorial_read.csv)

If you click the link, the data file will be downloaded to your computer. But you don’t want to download this data file to your computer, right? You want to download it to your server and then load it to your Jupyter Notebook. It only takes two steps.

STEP 1) Go back to your Jupyter Notebook and type this command:

!wget 46.101.230.157/dilan/pandas_tutorial_read.csv

This downloaded the pandas_tutorial_read.csv file to your server. Just check it out:

See? It’s there.

If you click it…

…you can even check out the data in it.

STEP 2) Now, go back again to your Jupyter Notebook and use the same read_csv function that we have used before (but don’t forget to change the file name and the delimiter value):

pd.read_csv('pandas_tutorial_read.csv', delimiter=';')

The data is loaded into pandas!

Does something feel off? Yes, this time we didn’t have a header in our csv file, so we have to set it up manually! Add the names parameter to your function!

pd.read_csv('pandas_tutorial_read.csv', delimiter=';', names = ['my_datetime', 'event', 'country', 'user_id', 'source', 'topic'])

Better!
And with that, we finally loaded our .csv data into a pandas dataframe!

Note 1: Just so you know, there is an alternative method. (I don’t prefer it though.) You can load the .csv data using the URL directly. In this case the data won’t be downloaded to your data server.

read the .csv directly from the server (using its URL)

Note 2: If you are wondering what’s in this data set – this is the data log of a travel blog. This is a log of one day only (if you are a JDS course participant, you will get much more of this data set on the last week of the course ;-)). I guess the names of the columns are fairly self-explanatory.

Selecting data from a dataframe in pandas

This is the first episode of this pandas tutorial series, so let’s start with a few very basic data selection methods – and in the next episodes we will go deeper!

1) Print the whole dataframe

The most basic method is to print your whole data frame to your screen. Of course, you don’t have to run the pd.read_csv() function again and again and again. Just store its output the first time you run it!

article_read = pd.read_csv('pandas_tutorial_read.csv', delimiter=';', names = ['my_datetime', 'event', 'country', 'user_id', 'source', 'topic'])

After that, you can call this article_read value anytime to print your DataFrame!

2) Print a sample of your dataframe

Sometimes, it’s handy not to print the whole dataframe and flood your screen with data. When a few lines is enough, you can print only the first 5 lines – by typing:

article_read.head()

Or the last few lines by typing:

Windows

article_read.tail()

Pandas For Python 3.7

Or a few random lines by typing:

article_read.sample(5)

3) Select specific columns of your dataframe

This one is a bit tricky! Let’s say you want to print the ‘country’ and the ‘user_id’ columns only.
You should use this syntax:

article_read[['country', 'user_id']]

Any guesses why we have to use double bracket frames? It seems a bit over-complicated, I admit, but maybe this will help you remember: the outer bracket frames tell pandas that you want to select columns, and the inner brackets are for the list (remember? Python lists go between bracket frames) of the column names.

By the way, if you change the order of the column names, the order of the returned columns will change, too:

article_read[['user_id', 'country']]

This is the DataFrame of your selected columns.

Note: Sometimes (especially in predictive analytics projects), you want to get Series objects instead of DataFrames. You can get a Series using any of these two syntaxes (and selecting only one column):

article_read.user_id
article_read['user_id']

output is a Series object and not a DataFrame object

4) Filter for specific values in your dataframe

If the previous one was a bit tricky, this one will be really tricky!

Let’s say, you want to see a list of only the users who came from the ‘SEO’ source. In this case you have to filter for the ‘SEO’ value in the ‘source’ column:

article_read[article_read.source 'SEO']

It’s worth it to understand how pandas thinks about data filtering:

STEP 1) First, between the bracket frames it evaluates every line: is the article_read.source column’s value 'SEO' or not? The results are boolean values (True or False).

STEP 2) Then from the article_read table, it prints every row where this value is True and doesn’t print any row where it’s False.

Does it look over-complicated? Maybe. But this is the way it is, so let’s just learn it because you will use this a lot! 😉

Functions can be used after each other

It’s very important to understand that pandas’s logic is very linear (compared to SQL, for instance). So if you apply a function, you can always apply another one on it. In this case, the input of the latter function will always be the output of the previous function.

E.g. combine these two selection methods:

article_read.head()[['country', 'user_id']]

This line first selects the first 5 rows of our data set. And then it takes only the ‘country’ and the ‘user_id’ columns.

Could you get the same result with a different chain of functions? Of course you can:

article_read[['country', 'user_id']].head()

In this version, you select the columns first, then take the first five rows. The result is the same – the order of the functions (and the execution) is different.

Pandas Python For Machine Learning

One more thing. What happens if you replace the ‘article_read’ value with the original read_csv() function:

pd.read_csv('pandas_tutorial_read.csv', delimiter=';', names = ['my_datetime', 'event', 'country', 'user_id', 'source', 'topic'])[['country', 'user_id']].head()

This will work, too – only it’s ugly (and inefficient). But it’s really important that you understand that working with pandas is nothing but applying the right functions and methods, one by one.

Test yourself!

As always, here’s a short assignment to test yourself! Solve it, so the content of this article can sink in better!

Select the user_id, the country and the topic columns for the users who are from country_2! Print the first five rows only!

Okay, go ahead and solve it!

.
.
.

And here’s my solution!
It can be a one-liner:

article_read[article_read.country 'country_2'][['user_id','topic', 'country']].head()

Pandas Python For Mac

How To Use Pandas Python

Or, to be more transparent, you can break this into more lines:

Either way, the logic is the same. First you take your original dataframe (article_read), then you filter for the rows where the country value is country_2 ([article_read.country 'country_2']), then you take the three columns that were required ([['user_id','topic', 'country']]) and eventually you take the first five rows only (.head()).

Conclusion

You are done with the first episode of my pandas tutorial series! Great job! In the next article, you can learn more about the different aggregation methods (e.g. sum, mean, max, min) and about grouping (so basically about segmentation). Stay with me: Pandas Tutorial, Episode 2!

Pandas For Python 2.7

  • If you want to learn more about how to become a data scientist, take my 50-minute video course: How to Become a Data Scientist. (It’s free!)
  • Also check out my 6-week online course: The Junior Data Scientist’s First Month video course.

Download Pandas For Python 3

Cheers,
Tomi Mester